faq
flatforty
contribute
subscribe
configure
search
rdf
main
parent
thread
|
Re: What I would like to see
by AC on Friday 20/Jun/2003, @01:22
|
>>So, what I think is needed is the ability to import PDF and PS files<<
Importing PS is almost impossible. PS is a full programming language. While it may be possible to extract the text, there are a million ways to position it.
PDF is a little bit easier, since it is a very restricted subset of the PS programming functionality (with some new layout abilities), but still very hard. Try Google's PDF->HTML rendering capabilities, and see how often it fails. |
|
|
The Fine Print: The following comments
are owned by whomever posted them.
( Reply )
|
Re: What I would like to see
by Datschge on Friday 20/Jun/2003, @02:09
|
You didn't look at KWord PSF import yet, did you?
|
[
Reply To This | View ]
|
Re: What I would like to see
by James Richard Tyrer on Friday 20/Jun/2003, @10:14
|
OH. $&#*, the "AC" trolls are back.
Your statement makes no sense. The fact that PS is a full programing language is irrelevant.
It is true that you can only import PS if it is being used to describe a page to be printed -- Penrose tile and affine transform ferns might be a problem. But, if a page can be described in one representation, then it can be converted to a different representation of the same page. How do you think that GhostScript displays stuff on the screen or sends it to the printer?
Perhaps it might even be easier than Micro$oft Word (*.doc) files since PS is fully documented and there is already a code base available OS (GhostScript & PS2Edit).
--
JRT
|
[
Reply To This | View ]
|
Re: What I would like to see
by not me on Sunday 22/Jun/2003, @01:14
|
This AC is not a troll. The fact that PS is a full programming language is *very* relevant. Writing a program that figures out exactly how the text flows on the page is nearly impossible. There are an infinite number of ways of positioning the text, so you can't just figure it out from the code. You have to run the PS and look at the output. But figuring out how the text on a page is positioned and how it flows around objects is a very hard problem, on the order of speech recognition or reading text. The text might be in several columns, right-justified, wrapping around circular objects. Who knows?
|
[
Reply To This | View ]
|
Re: What I would like to see
by James Richard Tyrer on Sunday 22/Jun/2003, @02:09
|
> Writing a program that figures out exactly how the text flows on the page is nearly impossible.
What you are saying is true but irrelevant. IIUC, you are saying that it is impossible for a program to determine information that is not contained in the PostScript file. I believe that that is a tautology.
However it is quite possible for a program to render a PostScript file as an image. We already have one of these you know -- it is called GhostScript.
There already exist programs that import PostScript and render it as an image -- The GIMP for example. WordPerfect will import PS (EPS) and render it as an image.
How do they do this if there is a font in the PostScript file? It is quite simple, they render it. To import it as text, all that is needed is not to render it but to use the same information to import it into the WordProcessor's representation of text.
We already have much of this functionality. Have you ever used: "pstoedit", "ps2ascii" & "pstotext"? If you know the text, its charismatics and where it goes on the page, you can make a WordProcessor file with that information.
--
JRT
|
[
Reply To This | View ]
|
Re: What I would like to see
by Vajsravana on Monday 23/Jun/2003, @06:49
|
> To import it as text, all that is needed is not to render it but to use the > same information to import it into the WordProcessor's representation of
> text.
This is not "importing". This is OCR.
You see... in many PS files there is no text at all! There is just a description of where some glyphs are to be displayed on the sheet. Some of the glyphs can be text (in the WP sense), some bitmap, some pure vectors.
Extracting from this the "WordProcessor's representation of text" is nothing more or less then OCR, not much different then extracting text from a bitmap or, a better example, a coreldraw file.
|
[
Reply To This | View ]
|
Re: What I would like to see
by James Richard Tyrer on Monday 23/Jun/2003, @11:57
|
To try to interpret your nonsense:
There are two possibilities:
1. The text is represented as glyphs directly or by a name or number that reference a font (either embedded or not embedded).
2. The text is a graphical representation.
In all cases that I know of, the default output of wordprocessors is #1 although some do offer #2 as an option.
Obviously, if it is #1 then the information can be imported as text, and if it is #2 then it can only be imported as a graphic.
Since you are a PostScript expert, I don't need to tell you that even if the text can not be extracted with: "ps2ascii", the PS file may still be #1.
If you had left an e-mail address, I would have sent you a sample. But, you can do it yourself. Make a document with KWord, print it to a PS file, convert it to PDF with: "ps2pdf" and import the PDF back into KWord (you need the import filter if you don't have 1.3 Beta). This might not work perfectly, but you will get text when you import it.
NOW, try: "ps2ascii" on the PS file. NO text. Open the PS file with KEdit. NO text.
Is it magic?? Or, perhaps you don't know what you are talking about. 8-D
--
JRT
|
[
Reply To This | View ]
|
Re: What I would like to see
by Vajsravana on Tuesday 24/Jun/2003, @06:53
|
> There are two possibilities:
>
> 1. The text is represented as glyphs directly or by a name or number that reference a font (either embedded or not embedded).
>
> 2. The text is a graphical representation.
What I meant to say is that, although not common, there are programs which generate postscript as pure vector graphics, keeping the glyphs as vector shapes and discarding the character and the font map that originates the glyphs (you can call it "graphical representation", but this is often used for bitmaps).
You can see a good example of these files if you use hylafax via WHFC and analyse the PS output of various win32 programs... the difference between similar documents printed in slightly different ways is sometimes amusing.
Of course, even if the result is similar when printed or viewed, there is no means of reimporting as text this kind of (otherwise perfectly legitimate) files, but only as a useless vector image.
This discards the idea of using PS as a standard archiving format, at least if you don't limit to programs that generates it "correctly".
> Obviously, if it is #1 then the information can be imported as text, and if it is #2 then it can only be imported as a graphic.
As you know, "as graphic" or "not at all" has exactly the same meaning in this context.
> Since you are a PostScript expert, I don't need to tell you that even if the text can not be extracted with: "ps2ascii", the PS file may still be #1.
>[...]
I usually don't use ps2ascii at all, I know very well how many problems it has, and I too skip directly to PDF instead, when I can.
> Is it magic?? Or, perhaps you don't know what you are talking about. 8-D
Or perhaps you did not even try to understand what I was talking about. :)
|
[
Reply To This | View ]
|
Re: What I would like to see
by James Richard Tyrer on Tuesday 24/Jun/2003, @10:24
|
> ... although not common, there are programs which generate postscript as pure vector graphics
So, if it is: "not common" it isn't really relevant to the question, is it?
> This discards the idea of using PS as a standard archiving format.
I didn't say that it should be used as a "standard archiving format", I have, and do, suggest that PDF be uses as a standard format. However, this does not mean that it would not be useful to be able to import PS files directly into your WordProcessor without having to convert them to EPS or PDF first.
> As you know, "as graphic" or "not at all" has exactly the same meaning in this context.
Well, importing them "as graphic" isn't going to get you the text, but this feature -- which is already somewhat available on WordPerfect (it requires EPS) -- still has its uses.
> Or perhaps you did not even try to understand what I was talking about. :)
Yes I did, but I could not tell if you were only talking about PS files that are totally graphic images or the PS files that appear not to contain text when they actually do.
Now that I fully understand, I can state that your reasoning is flawed. Your assertion appears to be that in some (not common) instances you will find a PS file that represents text as a graphic image and that, therefore, the ability to import PS (as text) into a WordProcessor is not a useful feature.
This is to say that because it won't always work that there is no point in having it. This is illogical -- backwards reasoning.
--
JRT
|
[
Reply To This | View ]
|
Re: What I would like to see
by Arboleya on Sunday 28/Mar/2004, @04:17
|
Well, is nearly impossible to know how the text is displayed by code analysis, but is perfectly possible by simulation of the code.
If I am not wrong, all the output in postscript is done by a virtual pen moving on the paper, but all the fonts are stored appart. If is like that, is possible to know where the fonts are being displayed and in which order, just simulating an execution of the program.
|
[
Reply To This | View ]
|
|
The Fine Print: The previous
comments are owned by whomever posted them.
( Reply )
|
|