You are reading a single comment by @chez_jay and its replies. Click here to read the full conversation.
  • It's behind a pay wall, but there's nothing special happening on the surface. If you were going to suggest scraping somehow, that won't work. Got in trouble for that earlier in the week.

  • Providing a very small fragment of source HTML, suitably edited to remove context, would really help in trying to understand exactly what the underlying issue is.

    I've just tested copying superscript HTML characters into Word (Paste Special > Unformatted text) and don't find it corrupts anything, and into Notepad++, and it just comes in as regular unformatted text.

    [edit: crossed over with your previous reply. Thing is, the "2" is part of the text content in HTML, so a method to strip it out probably needs to distinguish it by looking at the HTML formatting tags which surround it]

  • Cheers for looking, I'll grab some HTML now that I'm back at my desk.

    However, just to clarify: it doesn't corrupt anything in terms of representing the original text, but it maintains the text (albeit not as superscript) which is a corruption of the original source material.

    [edit: and just saw you edited! ha - here's an example of the above anyway, as I'd already found it:

    When Mr. <span class="lineunder">Faulkner</span> delivered me your former letter (for I have since had one sent me hither by Mr. <span class="lineunder">Pope</span><a href="/item/swifjoOU0040435a1c/nts/002" title="2 [marked '1' in source]" class="notecall_nts">2</a>) I was just got up from my bed
    

    ]

About

Avatar for chez_jay @chez_jay started