Sequences (peptide, DNA, RNA)
Peptide sequence format
Peptides can be entered using one or three letter amino acid abbreviations. A text file containing sequences should contain only one type of sequence (only one or only three lettered sequences but not both). Each line must have one and only one continuous line in the text file without spaces. Abbreviations used:
Codename: peptide
3-letter |
Ala |
Arg |
Asn |
Asp |
Asx |
Cys |
Gln |
Glu |
Glx |
Gly |
His |
Ile |
Leu |
Lys |
Met |
Phe |
Pro |
Pyl |
Sec |
Ser |
Thr |
Try |
Tyr |
Val |
Xaa |
Xle |
1-letter |
A |
R |
N |
D |
B |
C |
Q |
E |
Z |
G |
H |
I |
L |
K |
M |
F |
P |
O |
U |
S |
T |
W |
Y |
V |
X |
J |
Example
Valid files
PPPALPPKKR
APTMLPPASDFA
ProProProAlaLeuProProLysLysArg
AlaProThrMetProProProLeuProPro
Invalid files
PPPALPPKKR
AlaProThrMetProProProLeuProPro
ProProProAlaLeuProProLysLysArg
AlaProThrMetPPPLPP
Custom amino acids
Apart from the essential amino acids that are already recognizable, it is possible to define custom amino acids with non-standard sidechains or with alternative protonation states. The usual format of the dictionary file is:
Ala A [CX4H3][C@HX4H1]([NX3])C=O 3 4
Arg R [N;X3][C@@H]([CH2][CH2][CH2][N;H1X3][C;X3]([N;H2X3])=N)C=O 1 10
Asn N [#7;X3][C@@H]([CH2]C([N;H2X3])=O)[C;X3]=O 1 7
Asp D [NX3][C@@HH1]([CH2]C([OX2H1])=O)C=O 1 7
...
where the corresponding columns are:
-
long (three-letters code) abbreviation
-
short (one-letter code) abbreviation
-
SMARTS representation of the amino acid fragment
-
the number of the backbone N in the SMARTS string (the third atom for Ala in the first line of the example)
-
the number of the backbone C next to the acyl oxygen (fourth atom for Ala in the first line of example)
The columns should be separated by tab characters.
To create a custom amino acid abbreviation it is assumed that its name will start with X and some characters will follow this character between parentheses. Allowed characters are the letters of the alphabet, numbers and the dash character. It is adviced to set this string for both the short and the long name of the custom amino acid. Valid lines are:
X(Hcy) X(Hcy) [SX2H1][CH2][CH2][C@HH1]([NX3])C=O 5 6
X(1-foo) X(1-foo) [SX2H1][CH2][C@HH1]([NX3])C=O 4 5
X(b) X(b) [CH3][CH2][CH2][CH2][CH2][C@HH1]([NX3])C=O 7 8
...
Since Marvin 6.2 it's possible to define a molecule name in the dictionary. The name can be defined in the first column in the file using the molName= prefix:
molName=L-Alanine Ala A [CX4H3][C@HX4H1]([NX3])C=O 3 4
Note the SMARTS strings representing amino acid fragments are denoting the hydrogens and sometimes the connection numbers to avoid ambiguity. For example if only the C[C@H](N)C=O string is used for alanine, this would match for many other amino acids as well as some of them are "containing" alanine as a substructure. Also, the bonds in the query have to be exact, no query bonds allowed in them. To describe an aromatic custom amino acid both the aromatic and the Kekule form should be in the custom_aminoacids.dict file with the same short and long names. Users can store their custom amino acids in the custom_aminoacids.dict file in the .chemaxon directory (UNIX) or the user's chemaxon directory using MS Windows.
DNA/RNA sequence format
DNA/RNA sequences can be entered using one letter nucleic acid abbreviations. Each line must have one and only one continuous line in the text file without spaces. Abbreviations used:
DNA |
A |
C |
G |
T |
RNA |
A |
C |
G |
U |
Codename: dna, rna
Example
Valid files:
ACGTACGT
ACCCCGTGGGT
A-C-G-T-A-C-G-T
A-C-C-C-C-G-T-G-G-G-T
dA-dC-dG-dT-dA-dC-dG-dT
dA-dC-dC-dC-dC-dG-dT-dG-dG-dG-dT
Invalid files
acgtacgt