Earn extra cash as a bioinformatics expert
by pushing AI's frontier

We pay bioinformatics researchers to document the pipelines they already work with. Your documented workflow helps us evaluate whether AI can handle real scientific analysis.

$150 per task

Document a pipeline you already know. Most tasks take 1 to 2 hours.

Bioinformatics pipelines

Genome assembly, variant calling, RNA-seq, metagenomics, proteomics, or any computational pipeline you work with.

Help train AI agents

Your documented workflows become benchmarks that evaluate how far AI is from handling the same analysis you do.

Overview

Your role is to describe a task within your bioinformatics pipeline, break it into steps, and provide your answer and reasoning for each step.

Choose a task

Pick a task in your bioinformatics pipeline that is sufficiently long horizon, ideally 20+ steps. We're looking for extensive workflows that take hours in real economic work, not question and answer homework problems.

Break the task into steps

A step is a verifiable, distinguishable portion of the workflow that is instrumental in solving the problem. Go through a long task you've completed and identify every distinguishable part that was critical to its completion.

You can break a task into as many steps as you want. The more specific the better, but if two steps aren't really distinguishable from each other, they should be grouped into one. Every step must be necessary to solving the problem; if skipping it doesn't change the answer, it's not a step.

See example →

Provide your answer and reasoning for each step

You've already identified each step. Now, for each one, write down what your answer was and your reasoning process behind it. What was the output? What numbers did you get? Why did you make the choices you made?

See example →

Output format

Task Description of the full task

Step 1 What this step is

Rubric Answer and reasoning

Step 2 What this step is

Rubric Answer and reasoning

Step N What this step is

Rubric Answer and reasoning

Steps

A step is a verifiable, distinguishable portion of the workflow that is instrumental in solving the problem.

Here's an example

Task Given RNA-seq expression data from sorghum (34,129 genes across 4 organ types and 5 developmental stages), identify genes with stem specific expression, discover their regulatory motifs, build gene regulatory networks, and validate findings across genotypes.

Step 1 Load the expression matrix and confirm dataset dimensions — gene count, organ types, developmental stages, and replicate structure

Step 2 Run PCA to verify that samples cluster by organ type, confirming data quality

Step 3 Filter lowly expressed genes using an appropriate TPM threshold

Step 4 Exclude non vegetative organs (panicle and peduncle) that dilute stem specific signal

Step 5 Apply log2 transform to prepare data for correlation based network analysis

Step 6 Run a normality test to verify the statistical assumption required by WGCNA

Step 7 Compute the Tau organ specificity index per developmental stage to classify genes by tissue preference

Step 8 Examine scale free topology fit to select the appropriate soft thresholding power for WGCNA

Step 9 Build weighted co-expression networks using a signed network type that preserves correlation direction

Step 10 Compute correlations between module eigengenes and organ identity to find organ important modules

Step 11 Intersect Tau based specificity with module membership, requiring both lines of evidence for a gene to be called organ specific

Step 12 Run GO enrichment on organ specific gene sets using expressed genes as background and excluding IEA annotations

Step 13 Prune redundant GO terms using the hierarchy to keep only the most specific functional categories

Step 14 Scan promoter regions for known transcription factor binding motifs

Step 15 Test motif enrichment against a background set of non specific gene promoters

Step 16 Run de novo motif discovery and assess novelty against known motif databases

Step 17 Identify hub transcription factors from stem important modules using gene significance thresholds

Step 18 Build directed GRN edges from TF binding site predictions and undirected edges from stem only co-expression

Step 19 Integrate GRN layers using intersection (requiring both binding site and co-expression evidence per edge)

Step 20 Design experimental validation including both expression localization (ISH) and DNA binding (EMSA) with negative controls

Step 21 Validate stem specificity across both sweet and bioenergy sorghum genotype groups

Step 22 Build a phylogenetic tree for the hub TF family to assess evolutionary conservation across grass species

Rubric

For each step, the rubric is your answer and reasoning. It explains what the correct outcome looks like.

Step 1 Load expression matrix and confirm dataset dimensions

Rubric 34,129 genes across 99 samples from BTx623 genotype. 4 organ types (stem, leaf, root, seed) x 5 developmental stages x 4 to 5 replicates. The number of organs determines Tau specificity computation and the number of stages determines how many separate WGCNA networks to build.

Step 2 Run PCA to verify data quality

Rubric Samples cluster by organ type. Stem, leaf, root separate cleanly on PC1/PC2, confirming the expression data captures real biological differences rather than batch effects.

Step 3 Filter lowly expressed genes

Rubric TPM >= 5, retaining approximately 12,000 to 13,000 genes out of 34,129. I chose 5 because the replicate concordance drops off sharply below this threshold, and it's the standard cutoff for this platform. Lower thresholds (1 or 2) retain too much noise; higher (10+) loses legitimate low abundance genes.

Step 4 Exclude non vegetative organs

Rubric Remove panicle and peduncle samples. These reproductive/transitional organs dilute stem specific signal and are not relevant to stem gene discovery.

Step 5 Apply log2 transform

Rubric Log2(TPM+1) reduces skewness from ~10.9 to ~0.38. WGCNA requires approximately normal data for its correlation based network construction.

Step 6 Run normality test

Rubric Shapiro Wilk test after log2 transform confirms normality across all samples (p > 0.05). WGCNA's Pearson correlations assume approximately normal distributions, so this must pass before proceeding to network construction.

Step 7 Compute Tau organ specificity index

Rubric Tau >= 0.8 computed per stage (not pooled), because organ specificity changes across development. 0.8 is the literature standard cutoff. 21 to 23% of genes per stage have Tau >= 0.8, yielding 3,400 to 4,000 organ specific genes per stage with 500 to 900 being stem specific. Pooling across stages would mask temporal dynamics.

Step 8 Examine scale free topology fit

Rubric Select the lowest power where R squared exceeds 0.85. Juvenile stage: power 14 (R squared = 0.87). Vegetative: 16. Floral: 13. Anthesis: 14. Grain: 16. This is data driven, not guessed.

Step 9 Build co-expression networks

Rubric Signed network with Pearson correlation, producing 60 modules at the juvenile stage. I used signed because it preserves correlation direction: positively correlated genes cluster together, negatively correlated ones separate. Unsigned would mix activators and repressors in the same module, which is biologically meaningless for regulatory network construction.

Step 10 Compute module organ correlations

Rubric Point biserial correlation between module eigengenes and binary organ identity. Modules with positive correlation and p < 0.05 are organ important modules. I required positive direction because we want modules that go UP in the target organ, not just any association. This identified 3 to 5 stem important modules per stage.

Step 11 Intersect Tau and module membership

Rubric Gene is stem specific only if Tau >= 0.8 AND in a stem important module. Dual evidence reduces false positives. Yields 500 to 900 stem genes per stage, down from 3,400 to 4,000 with Tau alone.

Step 12 Run GO enrichment

Rubric Background = expressed genes only (not all annotated, which inflates significance). Exclude IEA annotations (computationally predicted, unreliable). Finds enrichment for cell wall biogenesis, carbohydrate metabolism, lignin biosynthesis.

Step 13 Prune redundant GO terms

Rubric Select most specific child terms from the GO hierarchy. For example, keep "cell wall biogenesis" but drop its parent "metabolic process" since the child already implies the parent.

Step 14 Scan promoters for known binding motifs

Rubric FIMO with JASPAR 2020 Core Plant database, p < 1e-4, scanning 1,500 bp upstream of ATG. I chose 1,500 bp because plant regulatory elements are concentrated in the proximal promoter. Shorter windows (500 bp) miss important elements; longer ones (3,000+ bp) introduce noise from distal regions that rarely function in plants.

Step 15 Test motif enrichment against background

Rubric Fisher's exact test comparing motif frequency in stem specific promoters vs non specific gene promoters (p < 0.05). Background set is non specific gene promoters because enrichment is relative — a motif present equally in all genes is not stem specific.

Step 16 Run de novo motif discovery and assess novelty

Rubric MEME with motif length 5 to 21 bp, E value < 0.05. Novelty assessed by PCC comparison to known motifs: PCC <= 0.49 (95th percentile of known motif self similarity) is considered novel. 12 novel motifs found.

Step 17 Identify hub transcription factors

Rubric Gene significance > 0.8 and p < 0.05 for stem identity, restricted to genes classified as TFs in PlantTFDB (because only TFs can directly regulate other genes). 84 hub TFs identified. I checked which maintain hub status across all 5 stages: only SbTALE03 and SbTALE04 do, making them the top candidates for validation.

Step 18 Build GRN edges

Rubric Directed layer: PlantPAN4.0 TFBS predictions. Undirected layer: PCC > 0.8 on stem only expression (not all organs, which dilutes stem specific correlations). Using all organ data introduces false positive edges.

Step 19 Integrate GRN using intersection

Rubric Edge exists only if both TFBS and co-expression support it. Yields 18 edges at juvenile, 11 to 31 across stages. Union would produce hundreds of false positive edges.

Step 20 Design experimental validation

Rubric RNAscope ISH for expression localization (confirms stem specificity in tissue sections) plus EMSA for DNA binding (confirms TF physically binds target promoters). Include dapB negative control for ISH and 100x unlabeled competitor for EMSA.

Step 21 Validate across genotypes

Rubric Test across 10 genotypes (6 sweet + 4 bioenergy sorghum) at 3 time points. 23 genes maintain stem specificity in >= 70% of genotype stage combinations, confirming results are not genotype specific.

Step 22 Build phylogenetic tree for hub TF family

Rubric MUSCLE alignment, JTT+gamma model (ModelTest NG lowest AICc), RAxML ML tree with 1000 bootstraps. SbTALE04 shows 92% identity to maize KNOTTED1, indicating deep evolutionary conservation of stem regulatory function.

Earn extra cash as a bioinformatics expertby pushing AI's frontier

$150 per task

Bioinformatics pipelines

Help train AI agents

Overview

Choose a task

Break the task into steps

Provide your answer and reasoning for each step

Output format

Steps

Here's an example

Rubric

Earn extra cash as a bioinformatics expert
by pushing AI's frontier