BioCoder: A Benchmark for Bioinformatics Code Generation with Large Language Models

1Yale University 2Google Deepmind
* Contributed equally.

BioCoder is a challenging bioinformatics code generation benchmark for examining the capabilities of state-of-the-art large language models (LLMs)


In addition to the dataset, BioCoder also features:

1. A syntax parser for real-world projects, which can be used to extract code snippets from GitHub repositories.

2. An flexible model generation framework that automates the code generation process.

3. A scalable testing framework utilizing both static analysis, manual test cases, and the fuzzing technique, which greatly improves the efficiency of writing test cases.

Abstract

Pre-trained language models like ChatGPT have significantly improved code generation. As these models scale up, there is an increasing need for the output to handle more intricate tasks. Moreover, in bioinformatics, generating functional programs poses additional notable challenges due to the amount of domain knowledge, the need for complicated data operations, and intricate functional dependencies between the operations. Here, we present BioCoder, a benchmark developed to evaluate existing pre-trained models in generating bioinformatics code. In relation to function-code generation, BioCoder covers potential package dependencies, class declarations, and global variables. It incorporates 1026 functions and 1243 methods in Python and Java from GitHub and 253 examples from the Rosalind Project. BioCoder incorporates a fuzz-testing framework for evaluation, and we have applied it to evaluate many models including InCoder, CodeGen, CodeGen2, SantaCoder, StarCoder, StarCoder+, InstructCodeT5+, and ChatGPT. Our detailed analysis of these models emphasizes the importance of domain knowledge, pragmatic code generation, and contextual understanding. Our dataset, benchmark, Docker images, and scripts required for testing are all available at this https URL.


Example Prompts

Here are more prompts, you can copy them and run the models on the playground of OpenAI.

Evaluation

BioCoder-Py

Notice: t=temperature, top-p=top-p cutoff, len = max context length (max generation length is 256). Furthermore, only the Summary Only scores are reported.

Rank Model Details Pass@1 Pass@5 Pass@10 Pass@20

1

Mar 14, 2023
gpt-4

Azure OpenAI

Completion

t=0.7, top-p=0.95

len = 8192

38.439 48.491 50.619 52.229

2

Mar 01, 2023
gpt-3.5-turbo

Azure OpenAI

Completion

t=0.7, top-p=0.95

len = 8192

24.682 33.997 37.132 40.127

3

May 09, 2023
Starcoder

Bigcode

Li et al., '23
Completion

t=0.7, top-p=0.95

len = 8192

4.682 15.225 21.200 27.166

4

Dec 22, 2022
SantaCoder

Bigcode

Allal et al., '22
Completion

t=0.7, top-p=0.95

len = 2048

2.965 9.848 14.227 18.181

5

Nov 08, 2022
InCoder-6B

Facebook AI

Fried et al., '22
Completion

t=0.7, top-p=0.95

len = 2048

1.688 5.320 8.332 12.006

6

May 03, 2023
CodeGen2-7B

Salesforce Research

Nijkamp et al., '23
Completion

t=0.7, top-p=0.95

len = 2048

0.860 2.494 3.962 6.242

7

Nov 08, 2022
CodeGen-6B

Salesforce Research

Nijkamp et al., '22
Completion

t=0.7, top-p=0.95

len = 2048

0.637 0.637 0.637 0.637

8

May 15, 2023
InstructCodeT5+ 16B

Salesforce Research

Wang et al., '23
Completion

t=0.7, top-p=0.95

len = 2048

0 0 0 0

BioCoder-Java

Notice: t=temperature, top-p=top-p cutoff, len = max context length (max generation length is 256). Furthermore, only the Summary Only scores are reported.

Rank Model Details Pass@1 Pass@5 Pass@10 Pass@20

1

Mar 14, 2023
gpt-4

Azure OpenAI

Completion

t=0.7, top-p=0.95

len = 8192

45.011 55.350 57.616 60.000

2

Mar 01, 2023
gpt-3.5-turbo

Azure OpenAI

Completion

t=0.7, top-p=0.95

len = 8192

17.400 33.199 37.878 42.000

3

May 09, 2023
Starcoder+

Bigcode

Li et al., '23
Completion

t=0.7, top-p=0.95

len = 8192

1.300 5.031 8.042 12.000

4

May 09, 2023
StarCoder

Bigcode

Li et al., '23
Completion

t=0.7, top-p=0.95

len = 8192

0 0 0 0

5

Dec 22, 2022
SantaCoder

Bigcode

Allal et al., '22
Completion

t=0.7, top-p=0.95

len = 2048

0 0 0 0

6

Nov 08, 2022
InCoder-6B

Facebook AI

Fried et al., '22
Completion

t=0.7, top-p=0.95

len = 2048

0 0 0 0

7

May 03, 2023
CodeGen2-7B

Salesforce Research

Nijkamp et al., '23
Completion

t=0.7, top-p=0.95

len = 2048

0 0 0 0

8

Nov 08, 2022
CodeGen-6B

Salesforce Research

Nijkamp et al., '22
Completion

t=0.7, top-p=0.95

len = 2048

0 0 0 0

9

May 15, 2023
InstructCodeT5+ 16B

Salesforce Research

Wang et al., '23
Completion

t=0.7, top-p=0.95

len = 2048

0 0 0 0

BioCoder-Rosalind

Notice: t=temperature, top-p=top-p cutoff, len = max context length (max generation length is 256). Furthermore, only the Summary Only scores are reported.

Rank Model Details Pass@1 Pass@5 Pass@10 Pass@20

1

Mar 14, 2023
gpt-4

Azure OpenAI

Completion

t=0.7, top-p=0.95

len = 8192

24.308 39.551 44.864 50.198

2

Mar 01, 2023
gpt-3.5-turbo

Azure OpenAI

Completion

t=0.7, top-p=0.95

len = 8192

23.671 31.953 36.702 40.725

3

May 09, 2023
Starcoder

Bigcode

Li et al., '23
Completion

t=0.7, top-p=0.95

len = 8192

0.534 2.042 3.228 4.743

4

Nov 08, 2022
CodeGen-6B

Salesforce Research

Nijkamp et al., '22
Completion

t=0.7, top-p=0.95

len = 2048

0.692 2.088 3.055 3.953

5

May 09, 2023
Starcoder+

Bigcode

Li et al., '23
Completion

t=0.7, top-p=0.95

len = 8192

0.356 1.313 1.978 2.767

6

Dec 22, 2022
SantaCoder

Salesforce Research

Allal et al., '22
Completion

t=0.7, top-p=0.95

len = 2048

0.158 0.658 1.075 1.581

7

May 03, 2023
CodeGen2-7B

Salesforce Research

Nijkamp et al., '23
Completion

t=0.7, top-p=0.95

len = 2048

0.059 0.296 0.593 1.186

8

May 15, 2023
InstructCodeT5+ 16B

Salesforce Research

Wang et al., '23
Completion

t=0.7, top-p=0.95

len = 2048

0.059 0.296 0.593 1.186

9

Nov 08, 2022
InCoder-6B

Facebook AI

Fried et al., '22
Completion

t=0.7, top-p=0.95

len = 2048

0.020 0.099 0.198 0.395