Skip to content

Fix ProForma charge_state#203

Merged
levitsky merged 11 commits into
levitsky:masterfrom
mobiusklein:fix/proforma-charge
May 28, 2026
Merged

Fix ProForma charge_state#203
levitsky merged 11 commits into
levitsky:masterfrom
mobiusklein:fix/proforma-charge

Conversation

@mobiusklein

@mobiusklein mobiusklein commented May 5, 2026

Copy link
Copy Markdown
Contributor

Closes #202

This changeset introduces an at-parse-time verification of tags' charge contribution so that they can be added to the total charge state. This found, while verifying, that XLMOD's prefix was incorrectly registered and that it wasn't properly handling composition loss syntax there. I also removed some tests that were using RESID modifications, which we don't support.

This might have a wider result where if you were previously tokenizing an invalid modification, it now fails early rather than silently waiting for the modification tag to be queried later.

One place this doesn't work yet is when you modify the modifications, it doesn't automatically refresh the view of the local charge. This life cycle management might be needed in the future, adding a bit more storage to the ProForma instance.

@levitsky

levitsky commented May 6, 2026

Copy link
Copy Markdown
Owner

Thank you!

Looking at the example from the tests, I have a question:

In [1]: from pyteomics.proforma import *

In [2]: seq = ProForma.parse("SEQUEN[Formula:Zn1:z+2]CE")

In [4]: seq.charge_state
Out[4]: ChargeState(2, [Adduct(name='H', charge=1, count=2)])

Is the adduct name H expected here? I thought it was supposed to represent a Zn ion.

@mobiusklein

Copy link
Copy Markdown
Contributor Author

The specification isn't explicit here, looping in @douweschulte. For context, this changeset adds a number of protons proportional to the positive charge of a charged modification, and the question is, shouldn't it just treat the modification as the adduct. I don't think what I have is correct either after going back and doing some side-by-side comparisons.

Quoting the spec at this time, from section 11.1

Charged formulas
Charges can be specified on chemical formulas using the following suffix ‘:z’. This allows localised charges to be defined. The mass of a charged formula is the mass of its constituents minus z times the mass of an electron.
SEQUEN[Formula:Zn1:z+2]CE
If the reader also supports the glycan composition notation, section 10.2, charged glycan compositions can be defined using the same syntax:
SEQUEN[Glycan:{C8H13N1O5Na1:z+1}1Hex2]CE

We can rewrite the example peptide SEQUEN[Formula:Zn1:z+2]CE as SEQUENCE/[Zn:z+2] which isn't quite identical (semantics of how the charge carrier(s) assort during fragmentation is outside the scope of ProForma), but for the purposes of calculating total charge is fine. However, what happens when you have a charged formula like this made-up molecule SEQUEN[Formula:C23H19O7N10Zn1:z+2]CE? Is the whole charged formula the charge carrier, or is only part of it contributing the charge? We can't know from context. In theory, you could be precise and say SEQUEN[Formula:C23H19O7N10][Formula:Zn1:z+2]CE, which would then be close to SEQUEN[Formula:C23H19O7N10]CE/[Zn:z+2], but unless it is written as such, we cannot decompose the big formula into charged and uncharged portions.

Additionally, this strategy does not work when you have something that already has an unpaired hydrogen already baked into it like some acidic compounds. The one I'll use is actually the monosaccharide NeuAc or n-acetyl neuraminic acid. The more of these you have in the glycan composition, the higher your charge state is on average, because it has an acidic group on one of the backbone carbons. The whole monosaccharide isn't a charge carrier, it just carries in a proton along the way.

If we were to neutralize the formula and then specify the charge separately, I think we get SEQUEN[Formula:ZnH-2]CE/[H:z+1^2]. The formula H-2 in the formula denotes two hydrogens being removed to make room for the zinc to make two bonds (or a double bond) to become neutral, and then we add two protons back in the adduct to reflect the charge state. This makes something that is actually 2 Da lighter than the other constructs (when you add the adduct mass to the total peptide mass)

from pyteomics import proforma, mass
prop = proforma.ProForma.parse("SEQUEN[Formula:Zn1:z+2]CE")
prop.charge_state
ChargeState(2, [Adduct(name='H', charge=1, count=0)])
adducted = proforma.ProForma.parse("SEQUENCE/[Zn:z+2]")
adducted.mass
988.2346971677199
prop.mass
1052.16383936772
prop.mass - mass.Composition("Zn").mass()
988.2346971677199
adducted.mz(), adducted.mass
(527.08919615063, 988.2346971677199)
neut = proforma.ProForma.parse("SEQUEN[Formula:Zn1H-2]CE/2")
neut.mz(), neut.mass
(527.08919615063, 1050.14818930358)
prop.mass, adducted.mass + adducted.charge_state.for_mz_calculation()[0], neut.mass
(1052.16383936772, 1052.16383936772, 1050.14818930358)
prop.mz(), adducted.mz(), neut.mz()
(527.08919615063, 527.08919615063, 527.08919615063)

What's the right answer? I'm not sure yet. I'll think more about this after I get home.

@levitsky

Copy link
Copy Markdown
Owner

The mass of a charged formula is the mass of its constituents minus z times the mass of an electron.

To me this reads that implicit protons should not be added to the charged formula.

Indeed, there doesn't seem to be a way to tell the charged part of the formula, so the only thing to do seems to be to designate it all as the adduct.
It would also make sense if the adducted notation and the localized formula were equivalent at least in terms of m/z (I can see that the neutral mass doesn't include the adduct and it makes sense).

@mobiusklein

Copy link
Copy Markdown
Contributor Author
def neutral_mass(mz, z, charge_carrier=PROTON):
    return (mz * abs(z)) - (z * charge_carrier)


def mass_charge_ratio(neutral_mass, z, charge_carrier=PROTON):
    return (neutral_mass + (z * charge_carrier)) / abs(z)

The equivalent modification is UNIMOD:Cation:Zn[ii], has a neutral mass formula of Zn1H-2 which implies that it will be a 2+ ion. This sloppily works out since when you don't know the mass of the charge carrier, you assume it's always H+, so when you push the m/z and the charge through neutral_mass() above you subtract charge * PROTON Da from the total, leaving you with the true neutral mass plus the mass of the charge carrier sans protons. It really is off by the mass of 2 electrons since H+ is not the same mass as H, but Unimod is only "good enough", not perfect.

I think I am taking a shortcut during the neutral mass calculation because the proton mass constant already factors in the loss of an electron. Iff a charged formula is not supposed to be part of the neutral mass of the peptidoform, then it'd be rolled into the adduct and this'd be fine. If it were, which was what I had in mind when I suggested this feature a few years ago, then it doesn't make sense to have its mass differ from the neutralized version (subtract an equal number of protons). This part is where I think the spec might be imprecise, @douweschulte.

If we follow the same logic here, I think that implies you must subtract the protons from the formula, and then add them back in as adducts OR include the lost protons as part of the charged formula, though these might be implementation details of how I built this in pyteomics.

@levitsky

Copy link
Copy Markdown
Owner

Something in this reasoning still evades me, to be honest. My understanding is that the Zn1H-2 formula works when applied for m/z calculation of a doubly (or higher) charged peptide, because two of the (implied) protons are not actually there, replaced by a zinc ion, is that correct?
Then, if starting from the neutral peptide composition, we only need to add the zinc instead of two protons, zinc being the charge carrier. I don't understand why we would add protons and show them as adducts when the protons are actually replaced by zinc.

@levitsky

levitsky commented May 11, 2026

Copy link
Copy Markdown
Owner

Some more comments in no particular order:

  • After the last commit the parser is showing a charge state with zero protons instead of two: ChargeState(2, [Adduct(name='H', charge=1, count=0)]).

  • Another question I have is how the localized charge should work with total charge: "SEQUEN[Formula:Zn1:z+2]CE/3". I would expect that the meaning of this is that the total charge is 3, two of these charges are carried by the zinc ion, so one must be a proton. The parser just adds everything together to produce ChargeState(5, [Adduct(name='H', charge=1, count=3)]).

  • The mass_charge_ratio function you cited from mass.py is actually incorrect for multiply charged carriers. Luckily it is never used with anything but protons now. It is used in ProForma.fragments and that only supports numerical charges and always adds protons.

  • The current fragments calculation would almost always be wrong with adducts or localized charges when fragment charge is not zero. It If we fix the logic in mass_charge_ratio to account for the charge of the carrier, we can make fragments work with non-proton adducts by actually looking for them in self.charge_state. We would maybe want to deduce all possible adducts a fragment series can have based on the requested charge. But to really interpret a sequence with localized charged formula correctly, we would only have to generate fragments containing the charge, and only the charges that are possible to get (e.g. I guess no singly-charged fragments for SEQUEN[Formula:Zn1:z+2]CE?

@douweschulte

Copy link
Copy Markdown

What I had in mind when working on this, but note that I do not have extensive experience in seeing this in actual spectra, is that the localised charge sets the charge of the formula as specified. So A[Formula:Zn1:z+2] has a charged zinc atom and no protons are involved. This is indeed imprecise in the sense that for bigger molecules the exact location of the charge is unspecified, if this location is really needed using smiles with the charge specified might be the best option in the future (currently unsupported according to the spec but has interest from some people so we could think about adding it).

In my own codebase I have not really figured out the exact semantics of neutral mass when these localised charges are present but for now I just return the entire thing without the global charge carriers (so including the localised charges) but this might be something that needs further deliberation. I follow your logic Joshua that the charged zinc should bind the amino acid replacing two hydrogens that then form two protons, I am unsure if this is true in the general case, but if it is we could adopt this as the standard way of calculating neutral masses and maybe add it to the spec.

On your second comment Lev on the combination of local and global charges, we did define the global charge carrier definition to be independent of local charges. So a peptide A[Formula:Zn:z+2]AA/3 would have a total charge of +5. As /3 is just a short hand notation for /[H:z+1^3] subtracting any local charges from these global carriers is weird, so the decision was made to make it so that the definitions are additive with the downside that the global charge of the entire definition is less easy to glance.

On your fourth comment Lev, if a peptide indeed has a charge state of +2 and that charge is entirely localised I would assume that only fragments containing that charge would be visible. So in your example SEQUEN[Formula:Zn1:z+2]CE only y fragments 3, 4, 5, 6, & 7 and b fragments 6 & 7, and all of these will be +2 fragments. I am unsure if this is also what happens in the real world so it might be nice to find some example spectra if they are around.

@caetera

caetera commented May 12, 2026

Copy link
Copy Markdown
Contributor

I comment regarding protonation/deprotonation and localized charges. In the end, it all boils down to which specific molecular structure is implied.

For example, PEPTIDE/2 or equivalently PEPTIDE/[H:z+1^2], implies that two protons were added to the neutral molecule - these are classically localized to the N-terminus, basic amino acids, and carbonyl oxygens, thus, equivalent structure (it is not a real ProForma notation, since it is not possible to put a localized charge on the back bone) with localized protons will be, for example, [H:z+1]-PEPT[H:z+1]IDE - proton "sitting" on the N-terminus and another one on a backbone carbonyl somewhere in a molecule.
Here is a more precise notation (R1 and R2 are some amino acid sequences, and R3 is a side chain)
image

What molecular structure is implied for SEQUE[Formula:Zn1:z+2]NCE?

  1. An addition of Zn 2+ ion to a neutral peptide molecule.
    In a more precise notation (R1 and R2 are side chains and R3 and R4 are amino acid sequences)
image
  1. An addition of Zn 2+ to a deprotonated peptide (i.e., effectively a Zn salt of a peptide).
    In a more precise notation (R1 and R2 are amino acid sequences)
image

The first case is likely the more natural to infer, since it is similar to PTM, i.e., the ion (defined as PTM bearing a fixed charge) "modifies" a specific amino acid. Note, however, that in many cases (including the case in question), the localization to a single amino acid is chemically unlikely - the modification "sits" on multiple positions - the charge is not really localized. Furthermore, the definition of neutral mass is also challenging. Strictly speaking, neutral mass will be the mass of the peptide itself - no charge carrier. The latter is not very useful for spectral annotation, however. If one considers the mass of the ion, i.e., m in m/z, that won't correspond to any "real" molecule, contrary to the protonation case.

The second case involves peptide deprotonation; thus, the molecule's charge is +1. Similarly, after double deprotonation of side chains, it is possible to form a salt-bridge-like structure. Note, however, that the charge again won't be localized to a single residue, and the molecule itself is neutral (undetectable in a mass spectrometer).
image

Which structure is implied?

For me, it seems that the main issue here is that fixed charge notation is often ambiguous. It can be that charge localiztion details are out of the scope of Proforma and are better described by SMILES or similar notation, if necessary.

@caetera

caetera commented May 12, 2026

Copy link
Copy Markdown
Contributor

Some chemical modifications introduce a fixed charge localized to a specific amino acid that does not involve any ion or charge carrier (for example, https://pmc.ncbi.nlm.nih.gov/articles/PMC13047680/ Scheme 1). The accurate mass of such a modification should be the elemental formula mass minus the mass of an electron. What would be the correct way to represent these kinds of modifications in ProForma?

@mobiusklein

Copy link
Copy Markdown
Contributor Author

Thank you @caetera, I didn't know to think of these different configurations. This idea was derived from the various metallic cation modifications in Unimod (https://www.unimod.org/modifications_list.php?a=search&value=1&SearchFor=Cation%3A&SearchOption=Contains&SearchField=). I spent a lot of time observing mixed protonated plus ammonium, sodium, iron, and calcium adducted peptidoforms at charges of 3-6+, but the total charge distribution of these precursor ions wasn't notably different from their purely protonated isoforms. Sometimes the cation adduct was detectable in product ions, some times not, on the same bond cleavage site, but the product ions with the cation on them tended to be higher charge state (self-citation: https://pubs.acs.org/doi/suppl/10.1021/acs.jproteome.0c00051/suppl_file/pr0c00051_si_001.pdf figure S2-d, but I could be over-analyzing this case).

This idea might better be expressed via the small population of charged modifications in PSI-Mod like

https://github.com/HUPO-PSI/psi-mod-CV/blob/18013a3e6489b0e79ca9bd967bbd29334d372753/PSI-MOD.obo#L39285-L39307

Unimod's cations are all effectively amino acid salts like https://github.com/HUPO-PSI/psi-mod-CV/blob/18013a3e6489b0e79ca9bd967bbd29334d372753/PSI-MOD.obo#L25760-L25779 now that I know to look for that word, which means I must apologize for calling them sloppy.

ProForma started as a way to write down what you saw, with few semantics for what that translated to in terms of mass or m/z. If you observe a delta mass addition at a position, and you know from context that it is charged, and given the delta mass you believe it is a particular chemical composition, then writing the charged formula is reasonable. Because ProForma was originally meant to describe proteoforms, it's within scope to think about an intact protein where there are a lot of charges randomly assorting around the structure but some having higher affinity to a region, so their being "localizable" based upon observation across one or more spectra would make writing down a tagged region P(EPTI)[...]IDE or ambiguous localization PE[...#m1(0.8)]P[#m1(0.1)]TIDE would be reasonable.

My thought process for implementation was naively just your first case, charged ion + "neutral" peptide. Since the salt explanation invalidates my belief in Formula:Zn1:z+2 being equivalent to Unimod:Zn[II], it's reasonable to subtract the electrons when computing a the "mass" of the composition for the peptidoform. I'll have to think more about the mass vs. neutral mass point, but I think this is going to be more of a "what does the user expect" issue.

@levitsky

Copy link
Copy Markdown
Owner

Hi Joshua,

What do we need to do to arrive at a satisfactory implementation, and can I help with anything?
On the one hand, I get questions from users about release timeline, and on the other, I have time to work on this until the end of next week, when I go on vacation. Do you think it would be possible to have this finalized by then?

I am open to leaving fragments for after the release, for example, and just raising a warning there when localized charges or adducts are present.

@mobiusklein

mobiusklein commented May 22, 2026 via email

Copy link
Copy Markdown
Contributor Author

…ion 11.5 of spec. Accept the assumption that charged formulae are part of the total mass of the ProForma peptide
@mobiusklein

Copy link
Copy Markdown
Contributor Author

I made two relatively large changes after accepting that for most users, a modification would be expected to be part of the peptide sequence, unlike the adduct. This is consistent with the discussion, and the notion that there are charged modifications in PSI-MOD. Again, thank you for the helpful discussion. Because we aren't trying to model a "distributed" charge, just mark a position, we don't need to make a change to fragments. If one wants to model uncertainty in localization at the residue level, ProForma already has a mechanism to do that.

The first was to update the handling of adducts and numerical charge states to match the wording in section 11.5. This required a small change to the parser, allowed the ChargeState and Adduct types to wrap around FormulaModification for all the actual mass and composition work they do. This also meant that we had to update the tests related to confirming adduct compositions test_composition_with_adducts which materializes the gained/lost electrons.

The second thing was to try to fix more (but not all) API compatibility issues that might appear after upgrading. Previously, charge_state was just an int, but I combined that with adducts to create the ChargeState type. I realized a fix was needed when I tried running code written assuming charge_state was settable and behaved like a number. To reduce friction, I made ChargeState try to turn itself into a regular int in as many scenarios as I could find in the Python data model by overriding the arithmetic operators. I stopped short of adding the in-place operators (e.g. __iadd__) because it introduces a need to be able to merge two ChargeStates, not reduce something to a simpler type.

@levitsky

Copy link
Copy Markdown
Owner

Thank you!

So the charged formula modifications now contribute to the charge state, but without an adduct:

In [44]: seq = ProForma.parse("SEQUEN[Formula:Zn1:z+2]CE")

In [45]: seq.charge_state
Out[45]: ChargeState(2, None)

What should it mean exactly? Right now it's hard to tell because there is no way to calculate m/z for a ProForma with this charge state, as ChargeState.adducts is expected to be iterable. Conversion to a string doesn't work for the same reason.

I see that Parser seems to create the charge state from localized charges in a quite deliberate way - ChargeState(0), then adding or subtracting the actual charge. In ChargeState.__init__, this falls precisely between the charge > 0 and charge < 0 conditions, leaving the adducts at None, which doesn't seem to be supported by the methods.

There was also a problem with m/z calculation for other ProForma sequences, where the protons were added twice. I tried to fix that.

@mobiusklein

Copy link
Copy Markdown
Contributor Author

Thank you. I found another issue with compounding charge states as well, but that has been addressed. The tests I had written in my development notebook are now part of the unit tests as I should have made them in the first place. This module is complicated enough I should also have been testing with coverage reporting enabled too. I'll spend some time plugging those holes after this release.

My intent is that charged formulae be treated like normal modifications for mass calculation and are indistinguishable from adducts for m/z calculation. This should now work and is tested explicitly.

@levitsky

levitsky commented May 27, 2026

Copy link
Copy Markdown
Owner

Thank you again.

There is a lot of room for surprise with localized charges still, like:

In [16]: seq = ProForma.parse("SEQUEN[Formula:Zn1:z+2]CE")

In [17]: seq.mz()
Out[17]: 526.0813711039506

In [18]: seq.mz(charge=2)
Out[18]: 527.0886475561111

I can see this is probably intended behavior, but adding two more protons to an already 2+ charged species should not result in this m/z. I am noting this but I don't know if there should be a warning of some sort, or some charge arithmetics needs to happen (i.e. add (given charge - sum of local charges) protons instead of just given charge).

As for fragments, you mentioned that they don't need to be changed; this is not entirely clear to me. They are also only calculated as (de-)protonated ions now, even if the ProForma string describes something that should not contain any protons, like the example with Zn we are working with. Singly-charged, singly-protonated fragments are the default, but presumably we would see none of those m/z in the spectrum. For the given species, I guess 1+ fragments would have to be deprotonated?
Can we do anything with that? Perhaps at least issue a warning if there are non-proton adducts or an empty list like in this case? We could then continue trying to introduce a more generalized default behavior.

@mobiusklein

Copy link
Copy Markdown
Contributor Author

Issuing a warning is reasonable for both mz and fragments. There is no correct answer here, this is now about minimizing surprise for the average user.

The fragments method's simple API isn't going to be able to cleanly cope with getting a different charge state on only some fragments, and we currently don't warn the user when they ask for fragments at a charge state higher than the recorded charge_state on the ProForma instance. For "completeness", without breaking the API, we can also let the user pass a ChargeState instance instead of an integer there. I don't think I'll be able to add much more before the Friday deadline.

@levitsky

Copy link
Copy Markdown
Owner

Thank you for pushing this forward. I think with the warnings in place we can ship this and then maybe get more feedback, and we can and should still get back to this area.
Accepting a ChargeState in fragments makes a lot of sense. As for the charge of fragments being different, I agree it's quite tricky. One could think of filtering the fragments by charge, or of deprotonating them to bring the charge down, for example. Implementing this comprehensively seems like a lot, and probably should be based more on existing use cases.

@levitsky levitsky merged commit ca48d00 into levitsky:master May 28, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Backwards incompatible changes in ProForma.charge_state

4 participants