PDF Document TextRegion has strange structure

We are migrating to new forums engine, no new registration or posting currently available. TIA for your patience.

PDF Document TextRegion has strange structure

Post by SebastianB » Thu Sep 27, 2012 09:58 AM

Hi,

I have problem with a PDF document. The documents contains some text-elements which are formatted with a special font (i.e. Tahoma).
I am iterating over the pages and textlines to check every symbol for beeing formatted with this font and attach those characters to a StringBuilder. the idea behind is to extract information from the document for further processing.

Here is what I see in any kind of PDF Viewer (this also includes the VintaSoft Demo Applications):
&Field1:608121 &Field64:01.07.2010 &Field3:12.286,75

I am using the following code to extract the required stuff:

var pdf = new PdfDocument(file);
                var sb = new StringBuilder();
                for (int iPage = 0; iPage < pdf.Pages.Count; iPage++)
                {
                    var page = pdf.Pages[iPage];
                    foreach (var textRegionLine in page.TextRegion.Lines)
                        foreach (var symbol in textRegionLine.Symbols)
                        {
                            //Compare fonts with allowed ones and add the symbol to the StringBuilder
                        }
                }
                pdf.Dispose();
                pdf.ClearCache();

An here is what I get (just a part of the output):
&Field1:608121 &Field64:01.07.2010 &Field3:12 286 75
.
,

Any suggestions? Any ideas?
Thanks,
Sebastian

Re: PDF Document TextRegion has strange structure

Post by Alex » Fri Sep 28, 2012 05:23 AM

Hello Sebastian,

Could you send us a demo project which demonstrates the issue? If yes, please send the project with description of the problem to support@vintasoft.com

Best regards, Alexander

Page 1 from 1: 1

VintaSoft PDF .NET Plug-in Discussions