Inline images extraction #1231

maiiabocharova · 2021-08-25T12:00:00Z

maiiabocharova
Aug 25, 2021

There are some small pictures in my PDFs and I am not able to extract them with page.getImageList() (it returns the empty list). Can you please recommend me what I can do to extract those images, maybe another library? I need those images very much)

Answered by JorjMcKie

Aug 25, 2021

page.get_text("dict")["blocks"] is a list of blocks on the page. Each one is either a text block or an image block - see documentation for TextPage. Image blocks have block["type"] = 1. The image binary is contained in block["image"]. More info is contained in the other dict keys.

The list of drawings dicts page.get_drawings() can be used to re-draw each on some other page - see the docu here.
Each "path" dict therein has a path["rect"], which is the rectangle containing all the elementary draws in it.
You could also do a page.get_pixmap(..., clip=path["rect"]) to create an image of the path. Of course there is the risk, that other things (not belonging to the path) are also part of that …

View full answer

JorjMcKie · 2021-08-25T12:09:18Z

JorjMcKie
Aug 25, 2021
Maintainer

If this happens, you

either are not looking at images, but at so-called drawings
or are looking at PDF "inline" images.

Drawings can be extracted via page.get_drawings() / page.get_cdrawings(). This is a list of dictionaries. Each dict represents a number of interconnected elementary draw commands: lines, curves, rectangles or quads.

Inline images are only contained in the internal page command source (the /Contents object(s)) - nowhere else. Method page.get_text("dict") (or "rawdict") contains all images on page - inline and others. Each image occurence occurs as many times as it is shown.

1 reply

maiiabocharova Aug 25, 2021
Author

Thank you for the reply!
In the page.get_text("dict") I am getting the thing which I need like text "\nd\n~\n\uf093\n",
page.get_cdrawings() gives me something, but can I make a png or get numpy array out of it? I want to do a template matching afterwards.

JorjMcKie · 2021-08-25T14:04:50Z

JorjMcKie
Aug 25, 2021
Maintainer

page.get_text("dict")["blocks"] is a list of blocks on the page. Each one is either a text block or an image block - see documentation for TextPage. Image blocks have block["type"] = 1. The image binary is contained in block["image"]. More info is contained in the other dict keys.

The list of drawings dicts page.get_drawings() can be used to re-draw each on some other page - see the docu here.
Each "path" dict therein has a path["rect"], which is the rectangle containing all the elementary draws in it.
You could also do a page.get_pixmap(..., clip=path["rect"]) to create an image of the path. Of course there is the risk, that other things (not belonging to the path) are also part of that picture by accident.

10 replies

JorjMcKie Aug 25, 2021
Maintainer

😎 They are no image whatsoever, they are no drawings, so I tried to select them with my cursor using a PDF viewer, and voilà: they are text!!!
Then I looked at the used fonts (page = doc[0])

>>> pprint(page.get_fonts())
[(47, 'n/a', 'Type1', 'Times-BoldItalic', 'R10', ''),
 (48, 'n/a', 'Type1', 'Helvetica', 'R11', ''),
 (49, 'ttf', 'TrueType', 'VGAVKN+DRSymb21', 'R12', ''),
 (51, 'n/a', 'Type1', 'Times-Roman', 'R14', ''),
 (45, 'n/a', 'Type1', 'Courier', 'R8', ''),
 (46, 'n/a', 'Type1', 'Times-Italic', 'R9', '')]

so these are probably characters from the DRSymb21 font. Therefore I did this:

>>> spans=[]
>>> for b in blocks:
	for l in b["lines"]:
		for s in l["spans"]:
			if s["font"] != "DRSymb21": continue
			spans.append(s)
>>> for i,s in enumerate(spans):
	pix=page.get_pixmap(clip=s["bbox"])
	pix.save("symb-%i.png" % i)

This gave me a number of PNGs, some of them the same or nonsense, but the above were among them:

maiiabocharova Aug 25, 2021
Author

This is so smart, would have never guessed, that there is such a solution!

Thank you!!! Just can't express how I am grateful!

maiiabocharova Aug 27, 2021
Author

I am so sorry to bother you once again, but it seems that other PDF documents which have the pictograms sometimes they are not not even text:

page = pdf_doc[1].get_textpage().extractDICT()

spans = [span for block in page['blocks'] 
         for line in block['lines'] for span in line["spans"]]
sorted(spans, key = lambda x: (x["bbox"][1], x["bbox"][0]))

I did this and went through all the spans near the position which I need and the pictogram was not there (In the previous document with which you helped me text corresponding to the span which contained pictogram was "\uf093",'\uf07f' -- so unicode). But for example with this PDF (second page)
it's not the case (I tried extracting images -- page.getImageList() extracted the logo of the company. but not the pictogram, drawings and text also have not extracted those pictograms)

This is such a complicated issue...

JorjMcKie Aug 27, 2021
Maintainer

this time the PDF page contains images, 257 (!!!) in total. Only one is reachable via an xref (xref 9). This is how to show them:

>>> page.clean_contents()  # required for this page: sloppy PDF creator, read the docu
>>> img=page.get_image_info(xrefs=True)
>>> len(img)
257
>>> # going to draw rects around each image, use a shape because of the large number
>>> shape=page.new_shape()
>>> for i in img:
	shape.draw_rect(i["bbox"])

	
Point(30.479999542236328, 56.64000701904297)
Point(78.95999908447266, 306.7200012207031)
Point(78.72000122070312, 306.96002197265625)
... and so on 257 times ...

>>> shape.finish(color=(0,1,0))  # surround all boxes with green border
>>> shape.commit()  # write shape to page
>>> doc.ez_save("x.pdf")  # save under new name
>>>

Gives you this:

So except for the logo, all those images are just single 1 or 2 pixels to define your pictograms. An expert at work, obviously!

maiiabocharova Aug 27, 2021
Author

Thank you so much, so I'll need to look at some other approaches to match pictograms with the templates which I have, definitely those are not to be extracted with standard approaches.

Thank you for spending time helping me!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inline images extraction #1231

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 11 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Inline images extraction #1231

Uh oh!

maiiabocharova Aug 25, 2021

Replies: 2 comments · 11 replies

Uh oh!

JorjMcKie Aug 25, 2021 Maintainer

Uh oh!

maiiabocharova Aug 25, 2021 Author

Uh oh!

JorjMcKie Aug 25, 2021 Maintainer

Uh oh!

JorjMcKie Aug 25, 2021 Maintainer

Uh oh!

maiiabocharova Aug 25, 2021 Author

Uh oh!

Uh oh!

maiiabocharova Aug 27, 2021 Author

Uh oh!

JorjMcKie Aug 27, 2021 Maintainer

Uh oh!

maiiabocharova Aug 27, 2021 Author

maiiabocharova
Aug 25, 2021

Replies: 2 comments 11 replies

JorjMcKie
Aug 25, 2021
Maintainer

maiiabocharova Aug 25, 2021
Author

JorjMcKie
Aug 25, 2021
Maintainer

JorjMcKie Aug 25, 2021
Maintainer

maiiabocharova Aug 25, 2021
Author

maiiabocharova Aug 27, 2021
Author

JorjMcKie Aug 27, 2021
Maintainer

maiiabocharova Aug 27, 2021
Author