Skip to content
Merged
Show file tree
Hide file tree
Changes from 52 commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
e37135c
Added the pseaac encoding algorithm
satvshr Jul 7, 2025
a5f01e0
Made pseaac to a class and made the functions private, still working …
satvshr Jul 7, 2025
3773a90
Made a few readability changes
satvshr Jul 7, 2025
9b9a3da
Edited tests
satvshr Jul 8, 2025
2dfe0c7
Added pytest to tests
satvshr Jul 8, 2025
1e182d3
Added numpy style docstrings and ruff formatting
satvshr Jul 8, 2025
fc2f051
Removed AptaNet from root
satvshr Jul 9, 2025
62f6c42
Added example
satvshr Jul 9, 2025
1515efe
Made requested changes
satvshr Jul 9, 2025
75d4efb
Merge branch 'main' into issue28
satvshr Jul 10, 2025
733f908
Made requested changes and updated tests
satvshr Jul 10, 2025
04ab599
Made suggested changes
satvshr Jul 11, 2025
dc78e44
Removed lint. from pyproject, will push it as a separate PR
satvshr Jul 11, 2025
c347988
Refactored code
satvshr Jul 11, 2025
d9537f4
Added pandas as a dependancy
satvshr Jul 11, 2025
1c46c55
Renamed parent folder name to put it in the same level as AptaNet
satvshr Jul 11, 2025
7781441
Refactored code and made architecture flexible
satvshr Jul 14, 2025
e762cc8
Edited docstrings and directory structure
satvshr Jul 14, 2025
e844d4f
Merge branch 'main' into issue28
satvshr Jul 14, 2025
f9392ef
weird rename experiment
satvshr Jul 14, 2025
beb45ec
weird rename experiment pt. 2
satvshr Jul 14, 2025
d603d07
Made requested changes
satvshr Jul 14, 2025
6ecf576
Made requested changes
satvshr Jul 15, 2025
b91c511
Made requested changes
satvshr Jul 15, 2025
b2428b0
chore: dummy commit to retrigger CI
satvshr Jul 15, 2025
2982954
Added missing init file to utils
satvshr Jul 15, 2025
0b5b388
Made requested changes
satvshr Jul 16, 2025
d24c4d7
Merge branch 'main' into issue28
satvshr Jul 16, 2025
0cd72b7
Added requested changes
satvshr Jul 16, 2025
fabc7b4
Added requested changes
satvshr Jul 16, 2025
32633d3
Added info about prop groups in class docstring
satvshr Jul 16, 2025
6136c39
Removed init method description
satvshr Jul 17, 2025
88c0122
editing changes
satvshr Jul 17, 2025
b7a7349
Made requested changes
satvshr Jul 18, 2025
c14c0bb
Made requested changes
satvshr Jul 18, 2025
2b9e8b2
Made PSeAAC independent
satvshr Jul 20, 2025
fdea833
Accidentally commited x.py
satvshr Jul 20, 2025
2d06039
Made sure PSeAAC wont reference 21 (as values may be added later)
satvshr Jul 20, 2025
ed7b30f
Edited class docstrings
satvshr Jul 22, 2025
180a957
Added 2 new lines before list
satvshr Jul 22, 2025
ee267d0
Merged with main and resolved conflicts
satvshr Jul 28, 2025
c8f2ac3
Merge main
satvshr Sep 10, 2025
4e8026f
Update _features.py
satvshr Sep 10, 2025
8a4be5f
bug fixes
satvshr Sep 10, 2025
28a74bc
Docstring and var name changes
satvshr Sep 10, 2025
7a404e5
docstring changes
satvshr Sep 10, 2025
b2f87c2
code formatting and checks
satvshr Sep 10, 2025
f885c92
Update _features.py
satvshr Sep 10, 2025
b6f33b1
Merge branch 'main' into issue59
satvshr Sep 30, 2025
106febc
Split into 2 classes
satvshr Sep 30, 2025
498f4e7
added better examples
satvshr Sep 30, 2025
5021a7d
Made requested changes
satvshr Oct 2, 2025
b342807
Update _features.py
satvshr Oct 6, 2025
21ff304
Made requested changes
satvshr Oct 11, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
160 changes: 103 additions & 57 deletions pyaptamer/pseaac/_features.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,48 +18,53 @@ class PSeAAC:
selected physicochemical properties and sequence-order correlations as described in
the PseAAC model by Chou.

The PSeAAC algorithm uses 21 normalized physiochemical (NP) properties of amino
acids, which we load from a predefined matrix using `aa_props`.These 21 properties
are grouped into 7 distinct property groups, with each group containing
3 consecutive properties. Specifically, the groups are arranged in order as follows:
Group 1 includes properties 1–3, Group 2 includes properties 4–6, and so on, up to
Group 7, which includes properties 19–21. The properties in order are:


1. Hydrophobicity
2. Hydrophilicity
3. Side-chain Mass
4. Polarity
5. Molecular Weight
6. Melting Point
7. Transfer Free Energy
8. Buriability
9. Bulkiness
10. Solvation Free Energy
11. Relative Mutability
12. Residue Volume
13. Volume
14. Amino Acid Distribution
15. Hydration Number
16. Isoelectric Point
17. Compressibility
18. Chromatographic Index
19. Unfolding Entropy Change
20. Unfolding Enthalpy Change
21. Unfolding Gibbs Free Energy Change

The PSeAAC algorithm uses normalized physicochemical (NP) properties of amino
acids, loaded from a predefined matrix using `aa_props`. Properties can be grouped
in one of three ways:

- `prop_indices`: A list of property indices (0-based) to select from the 21
available properties. If None, all 21 properties are used.
- `group_props`: If provided as an integer, the selected properties are grouped
into chunks of this size (e.g., `group_props=3` groups into sets of 3).
If None, the default is groups of size 3 (7 groups for 21 properties).
- `custom_groups`: A list of lists, where each sublist contains local column
indices into the selected property matrix. This overrides all other grouping
logic.

The 21 physicochemical properties (columns) are:

0. Hydrophobicity
1. Hydrophilicity
2. Side-chain Mass
3. Polarity
4. Molecular Weight
5. Melting Point
6. Transfer Free Energy
7. Buriability
8. Bulkiness
9. Solvation Free Energy
10. Relative Mutability
11. Residue Volume
12. Volume
13. Amino Acid Distribution
14. Hydration Number
15. Isoelectric Point
16. Compressibility
17. Chromatographic Index
18. Unfolding Entropy Change
19. Unfolding Enthalpy Change
20. Unfolding Gibbs Free Energy Change

Each feature vector consists of:


- 20 normalized amino acid composition features (frequency of each standard
amino acid)
- `self.lambda_val` sequence-order correlation features based on physicochemical
similarity between residues.
These (20 + `self.lambda_val`) features are computed for each of 7 predefined
property groups, resulting in a final vector of length (20 + `self.lambda_val`) * 7.
amino acid)
- `lambda_val` sequence-order correlation features (theta values) computed
from the selected physicochemical property groups.

See `transform` method for usage.
For each property group, the above (20 + `lambda_val`) features are computed,
resulting in a final vector of length (20 + lambda_val) * number of normalized
physiochemical (NP) property groups of amino acids (default 7).

Parameters
----------
Expand All @@ -69,15 +74,19 @@ class PSeAAC:
which should be of length greater than `lambda_val`.
weight : float, optional, default=0.05
The weight factor for the sequence-order correlation features.
prop_indices : list[int] or None, optional
Indices of properties to use (0-based). If None, all 21 properties are used.
group_props : int or None, optional
Group size for selected properties. If None, defaults to groups of 3.
custom_groups : list[list[int]] or None, optional
Explicit groupings of local property indices. Overrides `group_props`.

Attributes
----------
np_matrix : np.ndarray
A 20x21 matrix of normalized physicochemical properties for the 20 standard
amino acids.
prop_groups : list of tuple
List of 7 tuples, each containing indices of 3 properties that form a property
group.
np_matrix : np.ndarray of shape (20, n_props)
Normalized property values for the selected amino acids and properties.
prop_groups : list[list[int]]
Groupings of local property indices into `np_matrix`.

Methods
-------
Expand All @@ -93,28 +102,59 @@ class PSeAAC:
Example
-------
>>> from pyaptamer.pseaac import PSeAAC
>>> seq = "ACDFFKKIIKKLLMMNNPPQQQRRRRIIIIRRR"
>>> # Select only 6 properties and group into 3 groups of equal size
>>> pseaac = PSeAAC(prop_indices=[0, 1, 2, 3, 4, 5], group_props=2)
>>> # Custom grouping (4 groups)
>>> pseaac = PSeAAC(custom_groups=[[0, 1], [2, 3], [4, 5], [6, 7]])
>>> # Default: all properties, grouped into 7 groups of 3
>>> pseaac = PSeAAC()
>>> features = pseaac.transform("ACDEFGHIKLMNPQRHIKLMNPQRSTVWHIKLMNPQRSTVWY")
>>> print(features[:10])
[0.006 0.006 0.006 0.006 0.006 0.006 0.018 0.018 0.018 0.018]
"""

def __init__(self, lambda_val=30, weight=0.05):
def __init__(
self,
lambda_val=30,
weight=0.05,
prop_indices=None,
group_props=None,
custom_groups=None,
):
self.lambda_val = lambda_val
self.weight = weight

# Load normalized property matrix (20x21, rows=AA, cols=NP1-NP21)
self.np_matrix = aa_props(type="numpy", normalize=True)
# Each prop_group is a tuple of 3 columns (property indices)
self.prop_groups = [
(0, 1, 2),
(3, 4, 5),
(6, 7, 8),
(9, 10, 11),
(12, 13, 14),
(15, 16, 17),
(18, 19, 20),
]
if group_props is not None and custom_groups is not None:
raise ValueError(
"Specify only one of `group_props` or `custom_groups`,not both."
)

self.np_matrix = aa_props(
prop_indices=prop_indices, type="numpy", normalize=True
)
self._n_cols = self.np_matrix.shape[1] # The number of properties selected

if custom_groups:
self.prop_groups = custom_groups
elif group_props is None:
if self._n_cols % 3 != 0:
raise ValueError(
"Default grouping expects number of properties divisible by 3."
)
self.prop_groups = [
list(range(i, i + 3)) for i in range(0, self._n_cols, 3)
]
else:
if self._n_cols % group_props != 0:
raise ValueError(
f"Number of properties ({self._n_cols}) must be divisible by"
f"group_props ({group_props})."
)
self.prop_groups = [
list(range(i, i + group_props))
for i in range(0, self._n_cols, group_props)
]

def _normalized_aa(self, seq):
"""
Expand Down Expand Up @@ -179,12 +219,18 @@ def transform(self, protein_sequence):
protein_sequence : str
The input protein sequence consisting of valid amino acid characters
(A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y).
lambda_val : int, default=30
The maximum distance between residues considered in the sequence-order
correlation (θ) calculations.
weight : float, default=0.15
The weight factor that balances the contribution of sequence-order
correlation features relative to amino acid composition features.

Returns
-------
np.ndarray
A 1D NumPy array of length (20 + `self.lambda_val) * number of normalized
physiochemical (NP) property groups of amino acids (7).
physiochemical (NP) property groups of amino acids (default 7).
Each element consists of:
- 20 normalized amino acid composition features
- `self.lambda_val` normalized sequence-order correlation factors (theta
Expand Down
20 changes: 15 additions & 5 deletions pyaptamer/pseaac/_props.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,15 @@
import pandas as pd


def aa_props(type="numpy", normalize=True):
def aa_props(prop_indices=None, type="numpy", normalize=True):
"""
Amino acid physicochemical property matrix for PSeAAC.

This function provides a 20x21 matrix of physicochemical properties for the
20 standard amino acids. Each row corresponds to an amino acid (in the order:
A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y), and each column
corresponds to a property (P1–P21). The properties are in the order:

- Hydrophobicity
- Hydrophilicity
- Side-chain Mass
Expand All @@ -37,6 +38,7 @@ def aa_props(type="numpy", normalize=True):

References
----------

- https://github.com/nedaemami/AptaNet/blob/main/feature_extraction.py
- Hydrophobicity values are from JACS, 1962, 84: 4240-4246. (C. Tanford)
- Hydrophilicity values are from PNAS, 1981, 78:3824-3828 (T.P.Hopp & K.R.Woods)
Expand Down Expand Up @@ -66,6 +68,9 @@ def aa_props(type="numpy", normalize=True):

Parameters
----------
prop_indices : list of int, optional
List of indices (0-based) of properties to include (e.g., [0, 4, 7]).
If None, returns all 21 properties.
type : {'numpy', 'pandas'}, default='numpy'
If 'pandas', returns a DataFrame with amino acid and property labels.
If 'numpy', returns a numpy array.
Expand All @@ -79,10 +84,9 @@ def aa_props(type="numpy", normalize=True):
Returns
-------
props : numpy.ndarray or pandas.DataFrame (depending on `type`)

- Rows: standard amino acids (A, C, D, ..., Y)
- Columns: physicochemical properties (P1–P21) of the standard amino acids, as
mentioned in the original implementation:
https://github.com/nedaemami/AptaNet/blob/main/feature_extraction.py
- Columns: physicochemical properties of the standard amino acids.
- Entries: raw or normalized property values depending on `normalize`.

Examples
Expand Down Expand Up @@ -1072,8 +1076,14 @@ def aa_props(type="numpy", normalize=True):
]
).T # shape (20, 21)

if prop_indices is not None:
props = props[:, prop_indices]
selected_names = [prop_names[i] for i in prop_indices]
else:
selected_names = prop_names

if type == "pandas":
return pd.DataFrame(props, index=aa_order, columns=prop_names)
return pd.DataFrame(props, index=aa_order, columns=selected_names)
elif type == "numpy":
return props
else:
Expand Down
5 changes: 5 additions & 0 deletions pyaptamer/pseaac/aptanet/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
"""The PSeAAC encoding algorithm"""

from pyaptamer.pseaac.aptanet._features import AptaNetPSeAAC

__all__ = ["AptaNetPSeAAC"]
Loading