Coverage for pdfrw/pdfrw/objects/pdfstring.py: 96%

Shortcuts on this page

r m x   toggle line displays

j k   next/prev highlighted chunk

0   (zero) top of page

1   (one) first highlighted chunk

116 statements  

1# A part of pdfrw (https://github.com/pmaupin/pdfrw) 

2# Copyright (C) 2006-2017 Patrick Maupin, Austin, Texas 

3# 2016 James Laird-Wah, Sydney, Australia 

4# MIT license -- See LICENSE.txt for details 

5 

6""" 

7 

8================================ 

9PdfString encoding and decoding 

10================================ 

11 

12Introduction 

13============= 

14 

15 

16This module handles encoding and decoding of PDF strings. PDF strings 

17are described in the PDF 1.7 reference manual, mostly in chapter 3 

18(sections 3.2 and 3.8) and chapter 5. 

19 

20PDF strings are used in the document structure itself, and also inside 

21the stream of page contents dictionaries. 

22 

23A PDF string can represent pure binary data (e.g. for a font or an 

24image), or text, or glyph indices. For Western fonts, the glyph indices 

25usually correspond to ASCII, but that is not guaranteed. (When it does 

26happen, it makes examination of raw PDF data a lot easier.) 

27 

28The specification defines PDF string encoding at two different levels. 

29At the bottom, it defines ways to encode arbitrary bytes so that a PDF 

30tokenizer can understand they are a string of some sort, and can figure 

31out where the string begins and ends. (That is all the tokenizer itself 

32cares about.) Above that level, if the string represents text, the 

33specification defines ways to encode Unicode text into raw bytes, before 

34the byte encoding is performed. 

35 

36There are two ways to do the byte encoding, and two ways to do the text 

37(Unicode) encoding. 

38 

39Encoding bytes into PDF strings 

40================================ 

41 

42Adobe calls the two ways to encode bytes into strings "Literal strings" 

43and "Hexadecimal strings." 

44 

45Literal strings 

46------------------ 

47 

48A literal string is delimited by ASCII parentheses ("(" and ")"), and a 

49hexadecimal string is delimited by ASCII less-than and greater-than 

50signs ("<" and ">"). 

51 

52A literal string may encode bytes almost unmolested. The caveat is 

53that if a byte has the same value as a parenthesis, it must be escaped 

54so that the tokenizer knows the string is not finished. This is accomplished 

55by using the ASCII backslash ("\") as an escape character. Of course, 

56now any backslash appearing in the data must likewise be escaped. 

57 

58Hexadecimal strings 

59--------------------- 

60 

61A hexadecimal string requires twice as much space as the source data 

62it represents (plus two bytes for the delimiter), simply storing each 

63byte as two hexadecimal digits, most significant digit first. The spec 

64allows for lower or upper case hex digits, but most PDF encoders seem 

65to use upper case. 

66 

67Special cases -- Legacy systems and readability 

68----------------------------------------------- 

69 

70It is possible to create a PDF document that uses 7 bit ASCII encoding, 

71and it is desirable in many cases to create PDFs that are reasonably 

72readable when opened in a text editor. For these reasons, the syntax 

73for both literal strings and hexadecimal strings is slightly more 

74complicated that the initial description above. In general, the additional 

75syntax allows the following features: 

76 

77 - Making the delineation between characters, or between sections of 

78 a string, apparent, and easy to see in an editor. 

79 - Keeping output lines from getting too wide for some editors 

80 - Keeping output lines from being so narrow that you can only see the 

81 small fraction of a string at a time in an editor. 

82 - Suppressing unprintable characters 

83 - Restricting the output string to 7 bit ASCII 

84 

85Hexadecimal readability 

86~~~~~~~~~~~~~~~~~~~~~~~ 

87 

88For hexadecimal strings, only the first two bullets are relevant. The syntax 

89to accomplish this is simple, allowing any ASCII whitespace to be inserted 

90anywhere in the encoded hex string. 

91 

92Literal readability 

93~~~~~~~~~~~~~~~~~~~ 

94 

95For literal strings, all of the bullets except the first are relevant. 

96The syntax has two methods to help with these goals. The first method 

97is to overload the escape operator to be able to do different functions, 

98and the second method can reduce the number of escapes required for 

99parentheses in the normal case. 

100 

101The escape function works differently, depending on what byte follows 

102the backslash. In all cases, the escaping backslash is discarded, 

103and then the next character is examined: 

104 

105 - For parentheses and backslashes (and, in fact, for all characters 

106 not described otherwise in this list), the character after the 

107 backslash is preserved in the output. 

108 - A letter from the set of "nrtbf" following a backslash is interpreted as 

109 a line feed, carriage return, tab, backspace, or form-feed, respectively. 

110 - One to three octal digits following the backslash indicate the 

111 numeric value of the encoded byte. 

112 - A carriage return, carriage return/line feed, or line feed following 

113 the backslash indicates a line break that was put in for readability, 

114 and that is not part of the actual data, so this is discarded. 

115 

116The second method that can be used to improve readability (and reduce space) 

117in literal strings is to not escape parentheses. This only works, and is 

118only allowed, when the parentheses are properly balanced. For example, 

119"((Hello))" is a valid encoding for a literal string, but "((Hello)" is not; 

120the latter case should be encoded "(\(Hello)" 

121 

122Encoding text into strings 

123========================== 

124 

125Section 3.8.1 of the PDF specification describes text strings. 

126 

127The individual characters of a text string can all be considered to 

128be Unicode; Adobe specifies two different ways to encode these characters 

129into a string of bytes before further encoding the byte string as a 

130literal string or a hexadecimal string. 

131 

132The first way to encode these strings is called PDFDocEncoding. This 

133is mostly a one-for-one mapping of bytes into single bytes, similar to 

134Latin-1. The representable character set is limited to the number of 

135characters that can fit in a byte, and this encoding cannot be used 

136with Unicode strings that start with the two characters making up the 

137UTF-16-BE BOM. 

138 

139The second way to encode these strings is with UTF-16-BE. Text strings 

140encoded with this method must start with the BOM, and although the spec 

141does not appear to mandate that the resultant bytes be encoded into a 

142hexadecimal string, that seems to be the canonical way to do it. 

143 

144When encoding a string into UTF-16-BE, this module always adds the BOM, 

145and when decoding a string from UTF-16-BE, this module always strips 

146the BOM. If a source string contains a BOM, that will remain in the 

147final string after a round-trip through the encoder and decoder, as 

148the goal of the encoding/decoding process is transparency. 

149 

150 

151PDF string handling in pdfrw 

152============================= 

153 

154Responsibility for handling PDF strings in the pdfrw library is shared 

155between this module, the tokenizer, and the pdfwriter. 

156 

157tokenizer string handling 

158-------------------------- 

159 

160As far as the tokenizer and its clients such as the pdfreader are concerned, 

161the PdfString class must simply be something that it can instantiate by 

162passing a string, that doesn't compare equal (or throw an exception when 

163compared) to other possible token strings. The tokenizer must understand 

164enough about the syntax of the string to successfully find its beginning 

165and end in a stream of tokens, but doesn't otherwise know or care about 

166the data represented by the string. 

167 

168pdfwriter string handling 

169-------------------------- 

170 

171The pdfwriter knows and cares about two attributes of PdfString instances: 

172 

173 - First, PdfString objects have an 'indirect' attribute, which pdfwriter 

174 uses as an indication that the object knows how to represent itself 

175 correctly when output to a new PDF. (In the case of a PdfString object, 

176 no work is really required, because it is already a string.) 

177 - Second, the PdfString.encode() method is used as a convenience to 

178 automatically convert any user-supplied strings (that didn't come 

179 from PDFs) when a PDF is written out to a file. 

180 

181pdfstring handling 

182------------------- 

183 

184The code in this module is designed to support those uses by the 

185tokenizer and the pdfwriter, and to additionally support encoding 

186and decoding of PdfString objects as a convenience for the user. 

187 

188Most users of the pdfrw library never encode or decode a PdfString, 

189so it is imperative that (a) merely importing this module does not 

190take a significant amount of CPU time; and (b) it is cheap for the 

191tokenizer to produce a PdfString, and cheap for the pdfwriter to 

192consume a PdfString -- if the tokenizer finds a string that conforms 

193to the PDF specification, it will be wrapped in a PdfString object, 

194and if the pdfwriter finds an object with an indirect attribute, it 

195simply calls str() to ask it to format itself. 

196 

197Encoding and decoding are not actually performed very often at all, 

198compared to how often tokenization and then subsequent concatenation 

199by the pdfwriter are performed. In fact, versions of pdfrw prior to 

2000.4 did not even support Unicode for this function. Encoding and 

201decoding can also easily be performed by the user, outside of the 

202library, and this might still be recommended, at least for encoding, 

203if the visual appeal of encodings generated by this module is found 

204lacking. 

205 

206 

207Decoding strings 

208~~~~~~~~~~~~~~~~~~~ 

209 

210Decoding strings can be tricky, but is a bounded process. Each 

211properly-encoded encoded string represents exactly one output string, 

212with the caveat that is up to the caller of the function to know whether 

213he expects a Unicode string, or just bytes. 

214 

215The caller can call PdfString.to_bytes() to get a byte string (which may 

216or may not represent encoded Unicode), or may call PdfString.to_unicode() 

217to get a Unicode string. Byte strings will be regular strings in Python 2, 

218and b'' bytes in Python 3; Unicode strings will be regular strings in 

219Python 3, and u'' unicode strings in Python 2. 

220 

221To maintain application compatibility with earlier versions of pdfrw, 

222PdfString.decode() is an alias for PdfString.to_unicode(). 

223 

224Encoding strings 

225~~~~~~~~~~~~~~~~~~ 

226 

227PdfString has three factory functions that will encode strings into 

228PdfString objects: 

229 

230 - PdfString.from_bytes() accepts a byte string (regular string in Python 2 

231 or b'' bytes string in Python 3) and returns a PdfString object. 

232 - PdfString.from_unicode() accepts a Unicode string (u'' Unicode string in 

233 Python 2 or regular string in Python 3) and returns a PdfString object. 

234 - PdfString.encode() examines the type of object passed, and either 

235 calls from_bytes() or from_unicode() to do the real work. 

236 

237Unlike decoding(), encoding is not (mathematically) a function. 

238There are (literally) an infinite number of ways to encode any given 

239source string. (Of course, most of them would be stupid, unless 

240the intent is some sort of denial-of-service attack.) 

241 

242So encoding strings is either simpler than decoding, or can be made to 

243be an open-ended science fair project (to create the best looking 

244encoded strings). 

245 

246There are parameters to the encoding functions that allow control over 

247the final encoded string, but the intention is to make the default values 

248produce a reasonable encoding. 

249 

250As mentioned previously, if encoding does not do what a particular 

251user needs, that user is free to write his own encoder, and then 

252simply instantiate a PdfString object by passing a string to the 

253default constructor, the same way that the tokenizer does it. 

254 

255However, if desirable, encoding may gradually become more capable 

256over time, adding the ability to generate more aesthetically pleasing 

257encoded strings. 

258 

259PDFDocString encoding and decoding 

260~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

261 

262To handle this encoding in a fairly standard way, this module registers 

263an encoder and decoder for PDFDocEncoding with the codecs module. 

264 

265""" 

266 

267import re 

268import codecs 

269import binascii 

270import itertools 

271from ..py23_diffs import convert_load, convert_store 

272 

273def find_pdfdocencoding(encoding): 

274 """ This function conforms to the codec module registration 

275 protocol. It defers calculating data structures until 

276 a pdfdocencoding encode or decode is required. 

277 

278 PDFDocEncoding is described in the PDF 1.7 reference manual. 

279 """ 

280 

281 if encoding != 'pdfdocencoding': 

282 return 

283 

284 # Create the decoding map based on the table in section D.2 of the 

285 # PDF 1.7 manual 

286 

287 # Start off with the characters with 1:1 correspondence 

288 decoding_map = set(range(0x20, 0x7F)) | set(range(0xA1, 0x100)) 

289 decoding_map.update((0x09, 0x0A, 0x0D)) 

290 decoding_map.remove(0xAD) 

291 decoding_map = dict((x, x) for x in decoding_map) 

292 

293 # Add in the special Unicode characters 

294 decoding_map.update(zip(range(0x18, 0x20), ( 

295 0x02D8, 0x02C7, 0x02C6, 0x02D9, 0x02DD, 0x02DB, 0x02DA, 0x02DC))) 

296 decoding_map.update(zip(range(0x80, 0x9F), ( 

297 0x2022, 0x2020, 0x2021, 0x2026, 0x2014, 0x2013, 0x0192, 0x2044, 

298 0x2039, 0x203A, 0x2212, 0x2030, 0x201E, 0x201C, 0x201D, 0x2018, 

299 0x2019, 0x201A, 0x2122, 0xFB01, 0xFB02, 0x0141, 0x0152, 0x0160, 

300 0x0178, 0x017D, 0x0131, 0x0142, 0x0153, 0x0161, 0x017E))) 

301 decoding_map[0xA0] = 0x20AC 

302 

303 # Make the encoding map from the decoding map 

304 encoding_map = codecs.make_encoding_map(decoding_map) 

305 

306 # Not every PDF producer follows the spec, so conform to Postel's law 

307 # and interpret encoded strings if at all possible. In particular, they 

308 # might have nulls and form-feeds, judging by random code snippets 

309 # floating around the internet. 

310 decoding_map.update(((x, x) for x in range(0x18))) 

311 

312 def encode(input, errors='strict'): 

313 return codecs.charmap_encode(input, errors, encoding_map) 

314 

315 def decode(input, errors='strict'): 

316 return codecs.charmap_decode(input, errors, decoding_map) 

317 

318 return codecs.CodecInfo(encode, decode, name='pdfdocencoding') 

319 

320codecs.register(find_pdfdocencoding) 

321 

322class PdfString(str): 

323 """ A PdfString is an encoded string. It has a decode 

324 method to get the actual string data out, and there 

325 is an encode class method to create such a string. 

326 Like any PDF object, it could be indirect, but it 

327 defaults to being a direct object. 

328 """ 

329 indirect = False 

330 

331 

332 # The byte order mark, and unicode that could be 

333 # wrongly encoded into the byte order mark by the 

334 # pdfdocencoding codec. 

335 

336 bytes_bom = codecs.BOM_UTF16_BE 

337 bad_pdfdoc_prefix = bytes_bom.decode('latin-1') 

338 

339 # Used by decode_literal; filled in on first use 

340 

341 unescape_dict = None 

342 unescape_func = None 

343 

344 @classmethod 

345 def init_unescapes(cls): 

346 """ Sets up the unescape attributes for decode_literal 

347 """ 

348 unescape_pattern = r'\\([0-7]{1,3}|\r\n|.)' 

349 unescape_func = re.compile(unescape_pattern, re.DOTALL).split 

350 cls.unescape_func = unescape_func 

351 

352 unescape_dict = dict(((chr(x), chr(x)) for x in range(0x100))) 

353 unescape_dict.update(zip('nrtbf', '\n\r\t\b\f')) 

354 unescape_dict['\r'] = '' 

355 unescape_dict['\n'] = '' 

356 unescape_dict['\r\n'] = '' 

357 for i in range(0o10): 

358 unescape_dict['%01o' % i] = chr(i) 

359 for i in range(0o100): 

360 unescape_dict['%02o' % i] = chr(i) 

361 for i in range(0o400): 

362 unescape_dict['%03o' % i] = chr(i) 

363 cls.unescape_dict = unescape_dict 

364 return unescape_func 

365 

366 def decode_literal(self): 

367 """ Decode a PDF literal string, which is enclosed in parentheses () 

368 

369 Many pdfrw users never decode strings, so defer creating 

370 data structures to do so until the first string is decoded. 

371 

372 Possible string escapes from the spec: 

373 (PDF 1.7 Reference, section 3.2.3, page 53) 

374 

375 1. \[nrtbf\()]: simple escapes 

376 2. \\d{1,3}: octal. Must be zero-padded to 3 digits 

377 if followed by digit 

378 3. \<end of line>: line continuation. We don't know the EOL 

379 marker used in the PDF, so accept \r, \n, and \r\n. 

380 4. Any other character following \ escape -- the backslash 

381 is swallowed. 

382 """ 

383 result = (self.unescape_func or self.init_unescapes())(self[1:-1]) 

384 if len(result) == 1: 

385 return convert_store(result[0]) 

386 unescape_dict = self.unescape_dict 

387 result[1::2] = [unescape_dict[x] for x in result[1::2]] 

388 return convert_store(''.join(result)) 

389 

390 

391 def decode_hex(self): 

392 """ Decode a PDF hexadecimal-encoded string, which is enclosed 

393 in angle brackets <>. 

394 """ 

395 hexstr = convert_store(''.join(self[1:-1].split())) 

396 if len(hexstr) % 1: # odd number of chars indicates a truncated 0 

397 hexstr += '0' 

398 return binascii.unhexlify(hexstr) 

399 

400 

401 def to_bytes(self): 

402 """ Decode a PDF string to bytes. This is a convenience function 

403 for user code, in that (as of pdfrw 0.3) it is never 

404 actually used inside pdfrw. 

405 """ 

406 if self.startswith('(') and self.endswith(')'): 

407 return self.decode_literal() 

408 

409 elif self.startswith('<') and self.endswith('>'): 

410 return self.decode_hex() 

411 

412 else: 

413 raise ValueError('Invalid PDF string "%s"' % repr(self)) 

414 

415 def to_unicode(self): 

416 """ Decode a PDF string to a unicode string. This is a 

417 convenience function for user code, in that (as of 

418 pdfrw 0.3) it is never actually used inside pdfrw. 

419 

420 There are two Unicode storage methods used -- either 

421 UTF16_BE, or something called PDFDocEncoding, which 

422 is defined in the PDF spec. The determination of 

423 which decoding method to use is done by examining the 

424 first two bytes for the byte order marker. 

425 """ 

426 raw = self.to_bytes() 

427 

428 if raw[:2] == self.bytes_bom: 

429 return raw[2:].decode('utf-16-be') 

430 else: 

431 return raw.decode('pdfdocencoding') 

432 

433 # Legacy-compatible interface 

434 decode = to_unicode 

435 

436 # Internal value used by encoding 

437 

438 escape_splitter = None # Calculated on first use 

439 

440 @classmethod 

441 def init_escapes(cls): 

442 """ Initialize the escape_splitter for the encode method 

443 """ 

444 cls.escape_splitter = re.compile(br'(\(|\\|\))').split 

445 return cls.escape_splitter 

446 

447 @classmethod 

448 def from_bytes(cls, raw, bytes_encoding='auto'): 

449 """ The from_bytes() constructor is called to encode a source raw 

450 byte string into a PdfString that is suitable for inclusion 

451 in a PDF. 

452 

453 NOTE: There is no magic in the encoding process. A user 

454 can certainly do his own encoding, and simply initialize a 

455 PdfString() instance with his encoded string. That may be 

456 useful, for example, to add line breaks to make it easier 

457 to load PDFs into editors, or to not bother to escape balanced 

458 parentheses, or to escape additional characters to make a PDF 

459 more readable in a file editor. Those are features not 

460 currently supported by this method. 

461 

462 from_bytes() can use a heuristic to figure out the best 

463 encoding for the string, or the user can control the process 

464 by changing the bytes_encoding parameter to 'literal' or 'hex' 

465 to force a particular conversion method. 

466 """ 

467 

468 # If hexadecimal is not being forced, then figure out how long 

469 # the escaped literal string will be, and fall back to hex if 

470 # it is too long. 

471 

472 force_hex = bytes_encoding == 'hex' 

473 if not force_hex: 

474 if bytes_encoding not in ('literal', 'auto'): 

475 raise ValueError('Invalid bytes_encoding value: %s' 

476 % bytes_encoding) 

477 splitlist = (cls.escape_splitter or cls.init_escapes())(raw) 

478 if bytes_encoding == 'auto' and len(splitlist) // 2 >= len(raw): 

479 force_hex = True 

480 

481 if force_hex: 

482 # The spec does not mandate uppercase, 

483 # but it seems to be the convention. 

484 fmt = '<%s>' 

485 result = binascii.hexlify(raw).upper() 

486 else: 

487 fmt = '(%s)' 

488 splitlist[1::2] = [(b'\\' + x) for x in splitlist[1::2]] 

489 result = b''.join(splitlist) 

490 

491 return cls(fmt % convert_load(result)) 

492 

493 @classmethod 

494 def from_unicode(cls, source, text_encoding='auto', 

495 bytes_encoding='auto'): 

496 """ The from_unicode() constructor is called to encode a source 

497 string into a PdfString that is suitable for inclusion in a PDF. 

498 

499 NOTE: There is no magic in the encoding process. A user 

500 can certainly do his own encoding, and simply initialize a 

501 PdfString() instance with his encoded string. That may be 

502 useful, for example, to add line breaks to make it easier 

503 to load PDFs into editors, or to not bother to escape balanced 

504 parentheses, or to escape additional characters to make a PDF 

505 more readable in a file editor. Those are features not 

506 supported by this method. 

507 

508 from_unicode() can use a heuristic to figure out the best 

509 encoding for the string, or the user can control the process 

510 by changing the text_encoding parameter to 'pdfdocencoding' 

511 or 'utf16', and/or by changing the bytes_encoding parameter 

512 to 'literal' or 'hex' to force particular conversion methods. 

513 

514 The function will raise an exception if it cannot perform 

515 the conversion as requested by the user. 

516 """ 

517 

518 # Give preference to pdfdocencoding, since it only 

519 # requires one raw byte per character, rather than two. 

520 if text_encoding != 'utf16': 

521 force_pdfdoc = text_encoding == 'pdfdocencoding' 

522 if text_encoding != 'auto' and not force_pdfdoc: 

523 raise ValueError('Invalid text_encoding value: %s' 

524 % text_encoding) 

525 

526 if source.startswith(cls.bad_pdfdoc_prefix): 

527 if force_pdfdoc: 

528 raise UnicodeError('Prefix of string %r cannot be encoded ' 

529 'in pdfdocencoding' % source[:20]) 

530 else: 

531 try: 

532 raw = source.encode('pdfdocencoding') 

533 except UnicodeError: 

534 if force_pdfdoc: 

535 raise 

536 else: 

537 return cls.from_bytes(raw, bytes_encoding) 

538 

539 # If the user is not forcing literal strings, 

540 # it makes much more sense to use hexadecimal with 2-byte chars 

541 raw = cls.bytes_bom + source.encode('utf-16-be') 

542 encoding = 'hex' if bytes_encoding == 'auto' else bytes_encoding 

543 return cls.from_bytes(raw, encoding) 

544 

545 @classmethod 

546 def encode(cls, source, uni_type = type(u''), isinstance=isinstance): 

547 """ The encode() constructor is a legacy function that is 

548 also a convenience for the PdfWriter. 

549 """ 

550 if isinstance(source, uni_type): 

551 return cls.from_unicode(source) 

552 else: 

553 return cls.from_bytes(source)