Coverage for pdfrw/pdfrw/objects/pdfstring.py: 85%
Shortcuts on this page
r m x toggle line displays
j k next/prev highlighted chunk
0 (zero) top of page
1 (one) first highlighted chunk
Shortcuts on this page
r m x toggle line displays
j k next/prev highlighted chunk
0 (zero) top of page
1 (one) first highlighted chunk
1# A part of pdfrw (https://github.com/pmaupin/pdfrw)
2# Copyright (C) 2006-2017 Patrick Maupin, Austin, Texas
3# 2016 James Laird-Wah, Sydney, Australia
4# MIT license -- See LICENSE.txt for details
6"""
8================================
9PdfString encoding and decoding
10================================
12Introduction
13=============
16This module handles encoding and decoding of PDF strings. PDF strings
17are described in the PDF 1.7 reference manual, mostly in chapter 3
18(sections 3.2 and 3.8) and chapter 5.
20PDF strings are used in the document structure itself, and also inside
21the stream of page contents dictionaries.
23A PDF string can represent pure binary data (e.g. for a font or an
24image), or text, or glyph indices. For Western fonts, the glyph indices
25usually correspond to ASCII, but that is not guaranteed. (When it does
26happen, it makes examination of raw PDF data a lot easier.)
28The specification defines PDF string encoding at two different levels.
29At the bottom, it defines ways to encode arbitrary bytes so that a PDF
30tokenizer can understand they are a string of some sort, and can figure
31out where the string begins and ends. (That is all the tokenizer itself
32cares about.) Above that level, if the string represents text, the
33specification defines ways to encode Unicode text into raw bytes, before
34the byte encoding is performed.
36There are two ways to do the byte encoding, and two ways to do the text
37(Unicode) encoding.
39Encoding bytes into PDF strings
40================================
42Adobe calls the two ways to encode bytes into strings "Literal strings"
43and "Hexadecimal strings."
45Literal strings
46------------------
48A literal string is delimited by ASCII parentheses ("(" and ")"), and a
49hexadecimal string is delimited by ASCII less-than and greater-than
50signs ("<" and ">").
52A literal string may encode bytes almost unmolested. The caveat is
53that if a byte has the same value as a parenthesis, it must be escaped
54so that the tokenizer knows the string is not finished. This is accomplished
55by using the ASCII backslash ("\") as an escape character. Of course,
56now any backslash appearing in the data must likewise be escaped.
58Hexadecimal strings
59---------------------
61A hexadecimal string requires twice as much space as the source data
62it represents (plus two bytes for the delimiter), simply storing each
63byte as two hexadecimal digits, most significant digit first. The spec
64allows for lower or upper case hex digits, but most PDF encoders seem
65to use upper case.
67Special cases -- Legacy systems and readability
68-----------------------------------------------
70It is possible to create a PDF document that uses 7 bit ASCII encoding,
71and it is desirable in many cases to create PDFs that are reasonably
72readable when opened in a text editor. For these reasons, the syntax
73for both literal strings and hexadecimal strings is slightly more
74complicated that the initial description above. In general, the additional
75syntax allows the following features:
77 - Making the delineation between characters, or between sections of
78 a string, apparent, and easy to see in an editor.
79 - Keeping output lines from getting too wide for some editors
80 - Keeping output lines from being so narrow that you can only see the
81 small fraction of a string at a time in an editor.
82 - Suppressing unprintable characters
83 - Restricting the output string to 7 bit ASCII
85Hexadecimal readability
86~~~~~~~~~~~~~~~~~~~~~~~
88For hexadecimal strings, only the first two bullets are relevant. The syntax
89to accomplish this is simple, allowing any ASCII whitespace to be inserted
90anywhere in the encoded hex string.
92Literal readability
93~~~~~~~~~~~~~~~~~~~
95For literal strings, all of the bullets except the first are relevant.
96The syntax has two methods to help with these goals. The first method
97is to overload the escape operator to be able to do different functions,
98and the second method can reduce the number of escapes required for
99parentheses in the normal case.
101The escape function works differently, depending on what byte follows
102the backslash. In all cases, the escaping backslash is discarded,
103and then the next character is examined:
105 - For parentheses and backslashes (and, in fact, for all characters
106 not described otherwise in this list), the character after the
107 backslash is preserved in the output.
108 - A letter from the set of "nrtbf" following a backslash is interpreted as
109 a line feed, carriage return, tab, backspace, or form-feed, respectively.
110 - One to three octal digits following the backslash indicate the
111 numeric value of the encoded byte.
112 - A carriage return, carriage return/line feed, or line feed following
113 the backslash indicates a line break that was put in for readability,
114 and that is not part of the actual data, so this is discarded.
116The second method that can be used to improve readability (and reduce space)
117in literal strings is to not escape parentheses. This only works, and is
118only allowed, when the parentheses are properly balanced. For example,
119"((Hello))" is a valid encoding for a literal string, but "((Hello)" is not;
120the latter case should be encoded "(\(Hello)"
122Encoding text into strings
123==========================
125Section 3.8.1 of the PDF specification describes text strings.
127The individual characters of a text string can all be considered to
128be Unicode; Adobe specifies two different ways to encode these characters
129into a string of bytes before further encoding the byte string as a
130literal string or a hexadecimal string.
132The first way to encode these strings is called PDFDocEncoding. This
133is mostly a one-for-one mapping of bytes into single bytes, similar to
134Latin-1. The representable character set is limited to the number of
135characters that can fit in a byte, and this encoding cannot be used
136with Unicode strings that start with the two characters making up the
137UTF-16-BE BOM.
139The second way to encode these strings is with UTF-16-BE. Text strings
140encoded with this method must start with the BOM, and although the spec
141does not appear to mandate that the resultant bytes be encoded into a
142hexadecimal string, that seems to be the canonical way to do it.
144When encoding a string into UTF-16-BE, this module always adds the BOM,
145and when decoding a string from UTF-16-BE, this module always strips
146the BOM. If a source string contains a BOM, that will remain in the
147final string after a round-trip through the encoder and decoder, as
148the goal of the encoding/decoding process is transparency.
151PDF string handling in pdfrw
152=============================
154Responsibility for handling PDF strings in the pdfrw library is shared
155between this module, the tokenizer, and the pdfwriter.
157tokenizer string handling
158--------------------------
160As far as the tokenizer and its clients such as the pdfreader are concerned,
161the PdfString class must simply be something that it can instantiate by
162passing a string, that doesn't compare equal (or throw an exception when
163compared) to other possible token strings. The tokenizer must understand
164enough about the syntax of the string to successfully find its beginning
165and end in a stream of tokens, but doesn't otherwise know or care about
166the data represented by the string.
168pdfwriter string handling
169--------------------------
171The pdfwriter knows and cares about two attributes of PdfString instances:
173 - First, PdfString objects have an 'indirect' attribute, which pdfwriter
174 uses as an indication that the object knows how to represent itself
175 correctly when output to a new PDF. (In the case of a PdfString object,
176 no work is really required, because it is already a string.)
177 - Second, the PdfString.encode() method is used as a convenience to
178 automatically convert any user-supplied strings (that didn't come
179 from PDFs) when a PDF is written out to a file.
181pdfstring handling
182-------------------
184The code in this module is designed to support those uses by the
185tokenizer and the pdfwriter, and to additionally support encoding
186and decoding of PdfString objects as a convenience for the user.
188Most users of the pdfrw library never encode or decode a PdfString,
189so it is imperative that (a) merely importing this module does not
190take a significant amount of CPU time; and (b) it is cheap for the
191tokenizer to produce a PdfString, and cheap for the pdfwriter to
192consume a PdfString -- if the tokenizer finds a string that conforms
193to the PDF specification, it will be wrapped in a PdfString object,
194and if the pdfwriter finds an object with an indirect attribute, it
195simply calls str() to ask it to format itself.
197Encoding and decoding are not actually performed very often at all,
198compared to how often tokenization and then subsequent concatenation
199by the pdfwriter are performed. In fact, versions of pdfrw prior to
2000.4 did not even support Unicode for this function. Encoding and
201decoding can also easily be performed by the user, outside of the
202library, and this might still be recommended, at least for encoding,
203if the visual appeal of encodings generated by this module is found
204lacking.
207Decoding strings
208~~~~~~~~~~~~~~~~~~~
210Decoding strings can be tricky, but is a bounded process. Each
211properly-encoded encoded string represents exactly one output string,
212with the caveat that is up to the caller of the function to know whether
213he expects a Unicode string, or just bytes.
215The caller can call PdfString.to_bytes() to get a byte string (which may
216or may not represent encoded Unicode), or may call PdfString.to_unicode()
217to get a Unicode string. Byte strings will be regular strings in Python 2,
218and b'' bytes in Python 3; Unicode strings will be regular strings in
219Python 3, and u'' unicode strings in Python 2.
221To maintain application compatibility with earlier versions of pdfrw,
222PdfString.decode() is an alias for PdfString.to_unicode().
224Encoding strings
225~~~~~~~~~~~~~~~~~~
227PdfString has three factory functions that will encode strings into
228PdfString objects:
230 - PdfString.from_bytes() accepts a byte string (regular string in Python 2
231 or b'' bytes string in Python 3) and returns a PdfString object.
232 - PdfString.from_unicode() accepts a Unicode string (u'' Unicode string in
233 Python 2 or regular string in Python 3) and returns a PdfString object.
234 - PdfString.encode() examines the type of object passed, and either
235 calls from_bytes() or from_unicode() to do the real work.
237Unlike decoding(), encoding is not (mathematically) a function.
238There are (literally) an infinite number of ways to encode any given
239source string. (Of course, most of them would be stupid, unless
240the intent is some sort of denial-of-service attack.)
242So encoding strings is either simpler than decoding, or can be made to
243be an open-ended science fair project (to create the best looking
244encoded strings).
246There are parameters to the encoding functions that allow control over
247the final encoded string, but the intention is to make the default values
248produce a reasonable encoding.
250As mentioned previously, if encoding does not do what a particular
251user needs, that user is free to write his own encoder, and then
252simply instantiate a PdfString object by passing a string to the
253default constructor, the same way that the tokenizer does it.
255However, if desirable, encoding may gradually become more capable
256over time, adding the ability to generate more aesthetically pleasing
257encoded strings.
259PDFDocString encoding and decoding
260~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
262To handle this encoding in a fairly standard way, this module registers
263an encoder and decoder for PDFDocEncoding with the codecs module.
265"""
267import re
268import codecs
269import binascii
270import itertools
271from ..py23_diffs import convert_load, convert_store
273def find_pdfdocencoding(encoding):
274 """ This function conforms to the codec module registration
275 protocol. It defers calculating data structures until
276 a pdfdocencoding encode or decode is required.
278 PDFDocEncoding is described in the PDF 1.7 reference manual.
279 """
281 if encoding != 'pdfdocencoding':
282 return
284 # Create the decoding map based on the table in section D.2 of the
285 # PDF 1.7 manual
287 # Start off with the characters with 1:1 correspondence
288 decoding_map = set(range(0x20, 0x7F)) | set(range(0xA1, 0x100))
289 decoding_map.update((0x09, 0x0A, 0x0D))
290 decoding_map.remove(0xAD)
291 decoding_map = dict((x, x) for x in decoding_map)
293 # Add in the special Unicode characters
294 decoding_map.update(zip(range(0x18, 0x20), (
295 0x02D8, 0x02C7, 0x02C6, 0x02D9, 0x02DD, 0x02DB, 0x02DA, 0x02DC)))
296 decoding_map.update(zip(range(0x80, 0x9F), (
297 0x2022, 0x2020, 0x2021, 0x2026, 0x2014, 0x2013, 0x0192, 0x2044,
298 0x2039, 0x203A, 0x2212, 0x2030, 0x201E, 0x201C, 0x201D, 0x2018,
299 0x2019, 0x201A, 0x2122, 0xFB01, 0xFB02, 0x0141, 0x0152, 0x0160,
300 0x0178, 0x017D, 0x0131, 0x0142, 0x0153, 0x0161, 0x017E)))
301 decoding_map[0xA0] = 0x20AC
303 # Make the encoding map from the decoding map
304 encoding_map = codecs.make_encoding_map(decoding_map)
306 # Not every PDF producer follows the spec, so conform to Postel's law
307 # and interpret encoded strings if at all possible. In particular, they
308 # might have nulls and form-feeds, judging by random code snippets
309 # floating around the internet.
310 decoding_map.update(((x, x) for x in range(0x18)))
312 def encode(input, errors='strict'):
313 return codecs.charmap_encode(input, errors, encoding_map)
315 def decode(input, errors='strict'):
316 return codecs.charmap_decode(input, errors, decoding_map)
318 return codecs.CodecInfo(encode, decode, name='pdfdocencoding')
320codecs.register(find_pdfdocencoding)
322class PdfString(str):
323 """ A PdfString is an encoded string. It has a decode
324 method to get the actual string data out, and there
325 is an encode class method to create such a string.
326 Like any PDF object, it could be indirect, but it
327 defaults to being a direct object.
328 """
329 indirect = False
332 # The byte order mark, and unicode that could be
333 # wrongly encoded into the byte order mark by the
334 # pdfdocencoding codec.
336 bytes_bom = codecs.BOM_UTF16_BE
337 bad_pdfdoc_prefix = bytes_bom.decode('latin-1')
339 # Used by decode_literal; filled in on first use
341 unescape_dict = None
342 unescape_func = None
344 @classmethod
345 def init_unescapes(cls):
346 """ Sets up the unescape attributes for decode_literal
347 """
348 unescape_pattern = r'\\([0-7]{1,3}|\r\n|.)'
349 unescape_func = re.compile(unescape_pattern, re.DOTALL).split
350 cls.unescape_func = unescape_func
352 unescape_dict = dict(((chr(x), chr(x)) for x in range(0x100)))
353 unescape_dict.update(zip('nrtbf', '\n\r\t\b\f'))
354 unescape_dict['\r'] = ''
355 unescape_dict['\n'] = ''
356 unescape_dict['\r\n'] = ''
357 for i in range(0o10):
358 unescape_dict['%01o' % i] = chr(i)
359 for i in range(0o100):
360 unescape_dict['%02o' % i] = chr(i)
361 for i in range(0o400):
362 unescape_dict['%03o' % i] = chr(i)
363 cls.unescape_dict = unescape_dict
364 return unescape_func
366 def decode_literal(self):
367 """ Decode a PDF literal string, which is enclosed in parentheses ()
369 Many pdfrw users never decode strings, so defer creating
370 data structures to do so until the first string is decoded.
372 Possible string escapes from the spec:
373 (PDF 1.7 Reference, section 3.2.3, page 53)
375 1. \[nrtbf\()]: simple escapes
376 2. \\d{1,3}: octal. Must be zero-padded to 3 digits
377 if followed by digit
378 3. \<end of line>: line continuation. We don't know the EOL
379 marker used in the PDF, so accept \r, \n, and \r\n.
380 4. Any other character following \ escape -- the backslash
381 is swallowed.
382 """
383 result = (self.unescape_func or self.init_unescapes())(self[1:-1])
384 if len(result) == 1:
385 return convert_store(result[0])
386 unescape_dict = self.unescape_dict
387 result[1::2] = [unescape_dict[x] for x in result[1::2]]
388 return convert_store(''.join(result))
391 def decode_hex(self):
392 """ Decode a PDF hexadecimal-encoded string, which is enclosed
393 in angle brackets <>.
394 """
395 hexstr = convert_store(''.join(self[1:-1].split()))
396 if len(hexstr) % 1: # odd number of chars indicates a truncated 0
397 hexstr += '0'
398 return binascii.unhexlify(hexstr)
401 def to_bytes(self):
402 """ Decode a PDF string to bytes. This is a convenience function
403 for user code, in that (as of pdfrw 0.3) it is never
404 actually used inside pdfrw.
405 """
406 if self.startswith('(') and self.endswith(')'):
407 return self.decode_literal()
409 elif self.startswith('<') and self.endswith('>'):
410 return self.decode_hex()
412 else:
413 raise ValueError('Invalid PDF string "%s"' % repr(self))
415 def to_unicode(self):
416 """ Decode a PDF string to a unicode string. This is a
417 convenience function for user code, in that (as of
418 pdfrw 0.3) it is never actually used inside pdfrw.
420 There are two Unicode storage methods used -- either
421 UTF16_BE, or something called PDFDocEncoding, which
422 is defined in the PDF spec. The determination of
423 which decoding method to use is done by examining the
424 first two bytes for the byte order marker.
425 """
426 raw = self.to_bytes()
428 if raw[:2] == self.bytes_bom:
429 return raw[2:].decode('utf-16-be')
430 else:
431 return raw.decode('pdfdocencoding')
433 # Legacy-compatible interface
434 decode = to_unicode
436 # Internal value used by encoding
438 escape_splitter = None # Calculated on first use
440 @classmethod
441 def init_escapes(cls):
442 """ Initialize the escape_splitter for the encode method
443 """
444 cls.escape_splitter = re.compile(br'(\(|\\|\))').split
445 return cls.escape_splitter
447 @classmethod
448 def from_bytes(cls, raw, bytes_encoding='auto'):
449 """ The from_bytes() constructor is called to encode a source raw
450 byte string into a PdfString that is suitable for inclusion
451 in a PDF.
453 NOTE: There is no magic in the encoding process. A user
454 can certainly do his own encoding, and simply initialize a
455 PdfString() instance with his encoded string. That may be
456 useful, for example, to add line breaks to make it easier
457 to load PDFs into editors, or to not bother to escape balanced
458 parentheses, or to escape additional characters to make a PDF
459 more readable in a file editor. Those are features not
460 currently supported by this method.
462 from_bytes() can use a heuristic to figure out the best
463 encoding for the string, or the user can control the process
464 by changing the bytes_encoding parameter to 'literal' or 'hex'
465 to force a particular conversion method.
466 """
468 # If hexadecimal is not being forced, then figure out how long
469 # the escaped literal string will be, and fall back to hex if
470 # it is too long.
472 force_hex = bytes_encoding == 'hex'
473 if not force_hex:
474 if bytes_encoding not in ('literal', 'auto'):
475 raise ValueError('Invalid bytes_encoding value: %s'
476 % bytes_encoding)
477 splitlist = (cls.escape_splitter or cls.init_escapes())(raw)
478 if bytes_encoding == 'auto' and len(splitlist) // 2 >= len(raw):
479 force_hex = True
481 if force_hex:
482 # The spec does not mandate uppercase,
483 # but it seems to be the convention.
484 fmt = '<%s>'
485 result = binascii.hexlify(raw).upper()
486 else:
487 fmt = '(%s)'
488 splitlist[1::2] = [(b'\\' + x) for x in splitlist[1::2]]
489 result = b''.join(splitlist)
491 return cls(fmt % convert_load(result))
493 @classmethod
494 def from_unicode(cls, source, text_encoding='auto',
495 bytes_encoding='auto'):
496 """ The from_unicode() constructor is called to encode a source
497 string into a PdfString that is suitable for inclusion in a PDF.
499 NOTE: There is no magic in the encoding process. A user
500 can certainly do his own encoding, and simply initialize a
501 PdfString() instance with his encoded string. That may be
502 useful, for example, to add line breaks to make it easier
503 to load PDFs into editors, or to not bother to escape balanced
504 parentheses, or to escape additional characters to make a PDF
505 more readable in a file editor. Those are features not
506 supported by this method.
508 from_unicode() can use a heuristic to figure out the best
509 encoding for the string, or the user can control the process
510 by changing the text_encoding parameter to 'pdfdocencoding'
511 or 'utf16', and/or by changing the bytes_encoding parameter
512 to 'literal' or 'hex' to force particular conversion methods.
514 The function will raise an exception if it cannot perform
515 the conversion as requested by the user.
516 """
518 # Give preference to pdfdocencoding, since it only
519 # requires one raw byte per character, rather than two.
520 if text_encoding != 'utf16':
521 force_pdfdoc = text_encoding == 'pdfdocencoding'
522 if text_encoding != 'auto' and not force_pdfdoc:
523 raise ValueError('Invalid text_encoding value: %s'
524 % text_encoding)
526 if source.startswith(cls.bad_pdfdoc_prefix):
527 if force_pdfdoc:
528 raise UnicodeError('Prefix of string %r cannot be encoded '
529 'in pdfdocencoding' % source[:20])
530 else:
531 try:
532 raw = source.encode('pdfdocencoding')
533 except UnicodeError:
534 if force_pdfdoc:
535 raise
536 else:
537 return cls.from_bytes(raw, bytes_encoding)
539 # If the user is not forcing literal strings,
540 # it makes much more sense to use hexadecimal with 2-byte chars
541 raw = cls.bytes_bom + source.encode('utf-16-be')
542 encoding = 'hex' if bytes_encoding == 'auto' else bytes_encoding
543 return cls.from_bytes(raw, encoding)
545 @classmethod
546 def encode(cls, source, uni_type = type(u''), isinstance=isinstance):
547 """ The encode() constructor is a legacy function that is
548 also a convenience for the PdfWriter.
549 """
550 if isinstance(source, uni_type):
551 return cls.from_unicode(source)
552 else:
553 return cls.from_bytes(source)