Steps
A step is a verifiable, distinguishable portion of the workflow that is instrumental in solving the problem.
Here's an example
Step 1 Load the expression matrix and confirm dataset dimensions — gene count, organ types, developmental stages, and replicate structure
Step 2 Run PCA to verify that samples cluster by organ type, confirming data quality
Step 3 Filter lowly expressed genes using an appropriate TPM threshold
Step 4 Exclude non vegetative organs (panicle and peduncle) that dilute stem specific signal
Step 5 Apply log2 transform to prepare data for correlation based network analysis
Step 6 Run a normality test to verify the statistical assumption required by WGCNA
Step 7 Compute the Tau organ specificity index per developmental stage to classify genes by tissue preference
Step 8 Examine scale free topology fit to select the appropriate soft thresholding power for WGCNA
Step 9 Build weighted co-expression networks using a signed network type that preserves correlation direction
Step 10 Compute correlations between module eigengenes and organ identity to find organ important modules
Step 11 Intersect Tau based specificity with module membership, requiring both lines of evidence for a gene to be called organ specific
Step 12 Run GO enrichment on organ specific gene sets using expressed genes as background and excluding IEA annotations
Step 13 Prune redundant GO terms using the hierarchy to keep only the most specific functional categories
Step 14 Scan promoter regions for known transcription factor binding motifs
Step 15 Test motif enrichment against a background set of non specific gene promoters
Step 16 Run de novo motif discovery and assess novelty against known motif databases
Step 17 Identify hub transcription factors from stem important modules using gene significance thresholds
Step 18 Build directed GRN edges from TF binding site predictions and undirected edges from stem only co-expression
Step 19 Integrate GRN layers using intersection (requiring both binding site and co-expression evidence per edge)
Step 20 Design experimental validation including both expression localization (ISH) and DNA binding (EMSA) with negative controls
Step 21 Validate stem specificity across both sweet and bioenergy sorghum genotype groups
Step 22 Build a phylogenetic tree for the hub TF family to assess evolutionary conservation across grass species
Rubric
For each step, the rubric is your answer and reasoning. It explains what the correct outcome looks like.
Step 1 Load expression matrix and confirm dataset dimensions
Rubric 34,129 genes across 99 samples from BTx623 genotype. 4 organ types (stem, leaf, root, seed) x 5 developmental stages x 4 to 5 replicates. The number of organs determines Tau specificity computation and the number of stages determines how many separate WGCNA networks to build.
Step 2 Run PCA to verify data quality
Rubric Samples cluster by organ type. Stem, leaf, root separate cleanly on PC1/PC2, confirming the expression data captures real biological differences rather than batch effects.
Step 3 Filter lowly expressed genes
Rubric TPM >= 5, retaining approximately 12,000 to 13,000 genes out of 34,129. I chose 5 because the replicate concordance drops off sharply below this threshold, and it's the standard cutoff for this platform. Lower thresholds (1 or 2) retain too much noise; higher (10+) loses legitimate low abundance genes.
Step 4 Exclude non vegetative organs
Rubric Remove panicle and peduncle samples. These reproductive/transitional organs dilute stem specific signal and are not relevant to stem gene discovery.
Step 5 Apply log2 transform
Rubric Log2(TPM+1) reduces skewness from ~10.9 to ~0.38. WGCNA requires approximately normal data for its correlation based network construction.
Step 6 Run normality test
Rubric Shapiro Wilk test after log2 transform confirms normality across all samples (p > 0.05). WGCNA's Pearson correlations assume approximately normal distributions, so this must pass before proceeding to network construction.
Step 7 Compute Tau organ specificity index
Rubric Tau >= 0.8 computed per stage (not pooled), because organ specificity changes across development. 0.8 is the literature standard cutoff. 21 to 23% of genes per stage have Tau >= 0.8, yielding 3,400 to 4,000 organ specific genes per stage with 500 to 900 being stem specific. Pooling across stages would mask temporal dynamics.
Step 8 Examine scale free topology fit
Rubric Select the lowest power where R squared exceeds 0.85. Juvenile stage: power 14 (R squared = 0.87). Vegetative: 16. Floral: 13. Anthesis: 14. Grain: 16. This is data driven, not guessed.
Step 9 Build co-expression networks
Rubric Signed network with Pearson correlation, producing 60 modules at the juvenile stage. I used signed because it preserves correlation direction: positively correlated genes cluster together, negatively correlated ones separate. Unsigned would mix activators and repressors in the same module, which is biologically meaningless for regulatory network construction.
Step 10 Compute module organ correlations
Rubric Point biserial correlation between module eigengenes and binary organ identity. Modules with positive correlation and p < 0.05 are organ important modules. I required positive direction because we want modules that go UP in the target organ, not just any association. This identified 3 to 5 stem important modules per stage.
Step 11 Intersect Tau and module membership
Rubric Gene is stem specific only if Tau >= 0.8 AND in a stem important module. Dual evidence reduces false positives. Yields 500 to 900 stem genes per stage, down from 3,400 to 4,000 with Tau alone.
Step 12 Run GO enrichment
Rubric Background = expressed genes only (not all annotated, which inflates significance). Exclude IEA annotations (computationally predicted, unreliable). Finds enrichment for cell wall biogenesis, carbohydrate metabolism, lignin biosynthesis.
Step 13 Prune redundant GO terms
Rubric Select most specific child terms from the GO hierarchy. For example, keep "cell wall biogenesis" but drop its parent "metabolic process" since the child already implies the parent.
Step 14 Scan promoters for known binding motifs
Rubric FIMO with JASPAR 2020 Core Plant database, p < 1e-4, scanning 1,500 bp upstream of ATG. I chose 1,500 bp because plant regulatory elements are concentrated in the proximal promoter. Shorter windows (500 bp) miss important elements; longer ones (3,000+ bp) introduce noise from distal regions that rarely function in plants.
Step 15 Test motif enrichment against background
Rubric Fisher's exact test comparing motif frequency in stem specific promoters vs non specific gene promoters (p < 0.05). Background set is non specific gene promoters because enrichment is relative — a motif present equally in all genes is not stem specific.
Step 16 Run de novo motif discovery and assess novelty
Rubric MEME with motif length 5 to 21 bp, E value < 0.05. Novelty assessed by PCC comparison to known motifs: PCC <= 0.49 (95th percentile of known motif self similarity) is considered novel. 12 novel motifs found.
Step 17 Identify hub transcription factors
Rubric Gene significance > 0.8 and p < 0.05 for stem identity, restricted to genes classified as TFs in PlantTFDB (because only TFs can directly regulate other genes). 84 hub TFs identified. I checked which maintain hub status across all 5 stages: only SbTALE03 and SbTALE04 do, making them the top candidates for validation.
Step 18 Build GRN edges
Rubric Directed layer: PlantPAN4.0 TFBS predictions. Undirected layer: PCC > 0.8 on stem only expression (not all organs, which dilutes stem specific correlations). Using all organ data introduces false positive edges.
Step 19 Integrate GRN using intersection
Rubric Edge exists only if both TFBS and co-expression support it. Yields 18 edges at juvenile, 11 to 31 across stages. Union would produce hundreds of false positive edges.
Step 20 Design experimental validation
Rubric RNAscope ISH for expression localization (confirms stem specificity in tissue sections) plus EMSA for DNA binding (confirms TF physically binds target promoters). Include dapB negative control for ISH and 100x unlabeled competitor for EMSA.
Step 21 Validate across genotypes
Rubric Test across 10 genotypes (6 sweet + 4 bioenergy sorghum) at 3 time points. 23 genes maintain stem specificity in >= 70% of genotype stage combinations, confirming results are not genotype specific.
Step 22 Build phylogenetic tree for hub TF family
Rubric MUSCLE alignment, JTT+gamma model (ModelTest NG lowest AICc), RAxML ML tree with 1000 bootstraps. SbTALE04 shows 92% identity to maize KNOTTED1, indicating deep evolutionary conservation of stem regulatory function.