Montag, 13. Februar 2017

Where have all the standards gone? A singalong for archivists.

Recently, we noticed that the specification for the TIFF 6 file format has vanished from Adobe's website, where it was last hosted. As you might know, Adobe owns TIFF 6 due to legal circumstances created by the acquisition of Aldus in 1994.

Up until now, we used to rely on the fact that TIFF is publicly specified by the document that was always available. However, since Adobe has taken down the document, all we have left are the local copies on our workstations, and we only have those out of pure luck. The link to http://partners.adobe.com/public/developer/en/tiff/TIFF6.pdf has been dead for several months now.

This made us think about the standards and specifications themselves. We've always, half jokingly, said that we would have to preserve the standard documents in our repositories as well if we wanted to do our jobs right. We also thought that this would never be actually be necessary. Boy, were we wrong.

We're now gathering all the standard and specification documents for the file formats that we are using and that we are planning to use. These documents will then be ingested into the repository using separate workflows to keep our documents apart from the actual repository content. That way, we hope to have all documents at hand even if they vanished from the web.

From our new perspective, we urge all digital repositories to take care of not only their digital assets, but also of the standard documents they are using.

The TIFF user community just recently had to take a major hit when the domain owners of http://www.remotesensing.org/libtiff/ lost control of their domain, thus making the libtiff and the infrastructure around it unavailable for several weeks. Even though the LibTIFF is now available again at their new home (http://libtiff.maptools.org), we need to be aware that even widely available material might be unavailable from one day to another.


Freitag, 20. Januar 2017

repairing TIFF images - a preliminary report

During two years of operation, more than 3.000 ingests have been piling up in the Technical Analyst's workbench of our digital preservation software. The vast majority of them have been singled out by the format validation routines, indicating that there has been a problem with the standard compliance of these files. One can easily see that repairing these files is a lot of work that, because the repository software doesn't support batch operations for TIFF repairs, would require months of repetative tasks. Being IT personnel, we did the only sane thing that we could think of: let the computer take care of that. We extracted the files from our repository's working directory, copied them to a safe storage area and ran an automated repair routine on those files. In this article, we want to go a little into detail about how much of an effort repairing a large corpus of inhomogenously invalid TIFFs actually is, which errors we encountered and which tools we used to repair these errors.

So, let's first see how big our problem actually is. The Technical Analyst's workbench contains 3.101 submission information packages (SIPs), each of them containing exactly one Intellectual Entity (IE). These SIPs contain 107.218 TIFF files, adding up to a grand total of about 1,95 TB of storage. That's an average of 19,08 MB per TIFF image.

While the repository software does give an error message for invalid files that can be found in the WebUI, they cannot be extracted automatically, making them useless for our endeavour. Moreover, our preservation repo uses JHove's TIFF-hul module for TIFF validation, which cannot be modified to accomodate local validation policies. We use a policy that is largely based on Baseline TIFF, including a few extensions. To validate TIFFs against this policy (or any other policy that you can think of, for that matter), my colleague Andreas has created the tool checkit_tiff, which is freely (free as in free speech AND free beer) available on GitHub for anyone to use. We used this tool to validate our TIFF files and single out those that didn't comply with our policy. (If you are interested, we used the policy as configured in the config file cit_tiff6_baseline_SLUB.cfg, which covers the conditions covered in the german document http://www.slub-dresden.de/ueber-uns/slubarchiv/technische-standards-fuer-die-ablieferung-von-digitalen-dokumenten/langzeitarchivfaehige-dateiformate/handreichung-tiff/ as published on 2016-06-08.)

For the correction operations, we used the tool fixit_fiff (also created by Andreas and freely available), the tools tiffset and tiffcp from the libtiff suite and convert from ImageMagick. All of the operations ran on a virtual machine with 2x 2,2GHz CPUs and 3 GB RAM with a recent and fairly minimal Debian 8 installation. The storage was mounted via NFS 3 from a NetApp enterprise NAS system and connected via 10GBit Ethernet. Nevertheless, we only got around 35MB/s throughput during copy operations (and, presumeably, also during repair operations), which we'll have to further investigate in the future.

The high-level algorithm for the complete repair task was as follows:
  1. copy all of the master data from the digital repository to a safe storage for backup
  2. duplicate that backup data to a working directory to run the actual validation/repair in
  3. split the whole corpus into smaller chunks of 500 SIPs to keep processing times low and be able to react if something goes wrong
  4. run repair script, looping through all TIFFs in the chunk
    1. validate a tiff using checkit_tiff
    2. if TIFF is valid, go to next TIFF (step 4), else continue (try to repair TIFF)
    3. parse validation output to find necessary repair steps
    4. run necessary repair operations
    5. validate the corrected tiff using checkit_tiff to detect errors that haven't been corrected
    6. recalculate the checksums for the corrected files and replace the old checksums in the metadata with the new ones
  5. write report to log file
  6. parse through report log to identify unsolved problems, create repair recipies for those and/or enhance fixit_tiff
  7. restore unrepaired TIFFs from backup, rerun repair script
  8. steps 4-7 are run until only those files are left that cannot be repaired in an automatic workflow
During the several iterations of validation, failed correction and enhancements for the repair recipies, we found the following correctable errors. Brace yourself, it's a long list. Feel free to scroll past it for more condensed information.
  • "baseline TIFF should have only one IFD, but IFD0 at 0x00000008 has pointer to IFDn 0x<HEX_ADDRESS>"
    • This is a multipage TIFF with a second Image File Directory (IFD). Baseline TIFF requires only the first IFD to be interpreted by byseline TIFF readers.
  • "Invalid TIFF directory; tags are not sorted in ascending order"
    • This is a violation of the TIFF6 specification, which requires that TIFF tags in an IFD must be sorted ascending by their respective tag number.
  • "tag 256 (ImageWidth) should have value , but has value (values or count) was not found, but requested because defined"
    • The tag is required by the baseline TIFF specification, but wasn't fount in the file.
  • "tag 257 (ImageLength) should have value , but has value (values or count) was not found, but requested because defined"
    • Same here.
  • "tag 259 (Compression) should have value 1, but has value X"
    • This is a violation of our internal policy, which requires that TIFFs must be stored without any compression in place. Values for X that were found are 4, 5 and 7, which are CCITT T.6 bi-level encoding, LZW compression and TIFF/EP JPEG baseline DCT-based lossy compression, respectively. The latter one would be a violation of the TIFF6 specification. However, we've noticed that a few files in our corpus were actually TIFF/EPs, where Compression=7 is a valid value.
  • "tag 262 (Photometric) should have value <0-2>, but has value (values or count) 3"
    • The pixels in this TIFF are color map encoded. While this is valid TIFF 6, we don't allow it in the context of digital preservation.
  • "tag 262 (Photometric) should have value , but has value (values or count) was not found, but requested because defined"
    • The tag isn't present at all, even though it's required by the TIFF6 specification.
  • "tag 269 (DocumentName) should have value ^[[:print:]]*$, but has value (values or count) XXXXX"
    • The field is of ASCII type, but contains characters that are not from the 7-Bit ASCII range. Often, these are special characters that are specific to a country/region, like the German "ä, ö, ü, ß".
  • "tag 270 (ImageDescription) should have value word-aligned, but has value (values or count) pointing to 0x00000131 and is not word-aligned"
    • The TIFF6 specification requires IFD tag fields to always start at word boundaries, but this field does not, thus violating the specification.
  • "tag 271 (Make) should have value ^[[:print:]]*$, but has value (values or count)"
    • The Make tag is empty, even though the specification requires it contains a string of the manufacturer's name.
  • "tag 271 (Make) should have value ^[[:print:]]*$, but has value (values or count) Mekel"
    • That's a special case where scanners from the manufacturer Mekel write multiple NULL-Bytes ("\0") at the end of the Make tag, presumeably for padding. This, however, violates the TIFF6 specification.
  • "tag 272 (Model) should have value ^[[:print:]]*$, but has value (values or count)"
    • The Model tag is empty, even though the specification requires it contains a string of the scanner device's name.
  • "tag 273 (StripOffsets) should have value , but has value (values or count) was not found, but requested because defined"
    • The tag isn't present at all, even though it's required by the TIFF6 specification.
  • "tag 278 (RowsPerStrip) should have value , but has value (values or count) was not found, but requested because defined"
    • Same here.
  • "tag 278 (RowsPerStrip) should have value , but has value (values or count) with incorrect type: unknown type (-1)"
    • This error results from the previous one: if a field doesn't exist, then checkit_tiff will assume data type "-1", knowing that this is no valid type in the real world.
  • "tag 278 (RowsPerStrip) was not found, but requested because defined"
    • The tag isn't present at all, even though it's required by the TIFF6 specification.
  • "tag 279 (StripByteCounts) should have value , but has value (values or count) was not found, but requested because defined"
    • The field doesn't contain a value, which violates the TIFF6 specification.
  • "tag 282 (XResolution) should have value word-aligned, but has value (values or count) pointing to 0x00000129 and is not word-aligned"
    • The TIFF6 specification requires IFD tag fields to always start at word boundaries, but this field does not, thus violating the specification.
  • "tag 292 (Group3Options) is found, but is not whitelisted"
    • As compression is not allowed in our repository, we disallow this field that comes with certain compression types as well.
  • "tag 293 (Group4Options) is found, but is not whitelisted"
    • Same here.
  • "tag 296 (ResolutionUnit) should have value , but has value"
    • The tag ResolutionUnit is a required field and is set to "2" (inch) by default. However, if the field is completely missing (as was the case here), this is a violation of the TIFF6 specification.
  • "tag 296 (ResolutionUnit) should have value , but has value (values or count) with incorrect type: unknown type (-1)"
    • This error results from the previous one: if a field doesn't exist, then checkit_tiff will assume data type "-1", knowing that this is no valid type in the real world.
  • "tag 297 (PageNumber) should have value at [1]=1, but has value (values or count) at [1]=0"
    • The TIFF6 specification states that: "If PageNumber[1] is 0, the total number of pages in the document is not available.". We don't allow this in our repository by local policy.
  • "tag 306 (DateTime) should have value ^[12][901][0-9][0-9]:[01][0-9]:[0-3][0-9] [012][0-9]:[0-5][0-9]:[0-6][0-9]$, but has value (values or count) XXXXX"
    • That's one of the most common errors. It's utterly unbelievable how many software manufacturers don't manage to comply with the very clear rules of how the DateTime string in a TIFF needs to be formatted. This is a violation of the TIFF6 specification.
  • "tag 306 (DateTime) should have value should be  "yyyy:MM:DD hh:mm:ss", but has value (values or count) of datetime was XXXXX"
    • Same here
  • "tag 306 (DateTime) should have value word-aligned, but has value (values or count) pointing to 0x00000167 and is not word-aligned"
    • The TIFF6 specification requires IFD tag fields to always start at word boundaries, but this field does not, thus violating the specification.
  • "tag 315 (Artist) is found, but is not whitelisted"
    • The tag Artist may contain personal data and is forbidden by local policy.
  • "tag 317 (Predictor) is found, but is not whitelisted"
    • The tag Predcitor is needed for encoding schemes that are not part of the Baseline TIFF6 specification, so we forbid it by local policy.
  • "tag 320 (Colormap) is found, but is not whitelisted"
    • TIFFs with this error message contain a color map instead of being encoded as bilevel/greyscale/RGB images. This is something that is forbidden by policy, hence we need to correct it.
  • "tag 339 (SampleFormat) is found, but is not whitelisted"
    • This tag is forbidden by local policy.
  • "tag 33432 (Copyright) should have value ^[[:print:]]*$, but has value (values or count)"
    • The Copyright tag is only allowed to have character values from the 7-Bit ASCII range. TIFFs that violate this rule from the TIFF6 specification will throw this error.
  • "tag 33434 (EXIF ExposureTime) is found, but is not whitelisted"
    • EXIF tags may never be referenced out of IFD0, but always out of their own ExifIFD. As this probably hasn't happened here, this needs to be seen as a violation of the TIFF6 specification.
  • "tag 33437 (EXIF FNumber) is found, but is not whitelisted"
    • Same here.
  • "tag 33723 (RichTIFFIPTC / NAA) is found, but is not whitelisted"
    • This tag is not allowed by local policy.
  • "tag 34665 (EXIFIFDOffset) should have value , but has value"
    • In all cases that we encountered, the tag EXIFIFDOffset was set to the wrong type. Instead of being of type 4, it was of type 13, which violates the TIFF specification.
  • "tag 34377 (Photoshop Image Ressources) is found, but is not whitelisted"
    • This proprietary tag is not allowed by local policy.
  • "tag 34675 (ICC Profile) should have value pointing to valid ICC profile, but has value (values or count) preferred cmmtype ('APPL') should be empty or (possibly, because ICC validation is alpha code) one of following strings: 'ADBE' 'ACMS' 'appl' 'CCMS' 'UCCM' 'UCMS' 'EFI ' 'FF  ' 'EXAC' 'HCMM' 'argl' 'LgoS' 'HDM ' 'lcms' 'KCMS' 'MCML' 'WCS ' 'SIGN' 'RGMS' 'SICC' 'TCMM' '32BT' 'WTG ' 'zc00'"
    • This is a juicy one. This error message indicates that something's wrong with the embedded ICC profile. In fact, the TIFF itself might be completely intact, but the ICC profile has the value of the cmmtype field set to a value that is not part of the controlled vocabulary for this field, so the ICC standard is violated.
  • "tag 34852 (EXIF SpectralSensitivity) is found, but is not whitelisted"
    • EXIF tags may never be referenced out of IFD0, but always out of their own ExifIFD. 
  • "tag 34858 (TimeZoneOffset (TIFF/EP)) is found, but is not whitelisted"
    • TIFF/EP tags are not allowed in plain TIFF6 images.
  • "tag 36867 (EXIF DateTimeOriginal) is found, but is not whitelisted"
    • EXIF tags may never be referenced out of IFD0, but always out of their own ExifIFD.
  • "tag 37395 (ImageHistory (TIFF/EP)) is found, but is not whitelisted"
    • Same here.
Some of the errors, however, could not be corrected by means of an automatic workflow. These images will have to be rescanned from their respective originals:
  • "tag 282 (XResolution) should have value <300-4000>, but has value (values or count) 200, 240, 273, 72"
    • This tag contains a value for the image's horizontal resolution that is too low for what is needed to comply with the policy. In this special case, that policy is not our own, but the one stated in the German Research Foundation's (Deutsche Forschungsgemeinschaft, DFG) "Practical Guidelines for Digitisation" (DFG-Praxisregeln "Digitalisierung", document in German, http://www.dfg.de/formulare/12_151/12_151_de.pdf), where a minimum of 300 dpi is required for digital documents that were scanned from an analog master and are intended for close examination. 1.717 files contained this error.
  • "tag 283 (YResolution) should have value <300-4000>, but has value (values or count) 200, 240, 273, 72"
    • Same here, but for the vertical resolution.
  • "tag 297 (PageNumber) should have value at [1]=1, but has value (values or count) at [1]=2"
    • This error message indicates that the TIFF has more than one pages (in this case two master images), which is forbidden by our internal policy. Five images contained this error.
  • "tag 297 (PageNumber) should have value at [1]=1, but has value (values or count) at [1]=3"
    • Same here. One image contained this error.
  • "tag 297 (PageNumber) should have value at [1]=1, but has value (values or count) at [1]=5"
    • Same here. One image contained this error.
  • "TIFF Header read error3: Success"
    • This TIFF was actually broken, had a file size of only 8 Bytes and was already defective when it was ingested into the repository. One image contained this error.
From our experiences, Andreas has created eight new commits for fixit_tiff (commits f51f71d to cf9b824) that made fixit_tiff more capable and more independent of the libtiff, which contained quite some bugs and sometimes even created problems in corrected TIFFs that didn't exist before. He also improved checkit_tiff to vastly increase performance (3-4 orders of magnitude) and helped build correction recipies.

The results are quite stunning and saved us a lot of work:
  • Only 1.725 out of 107.218 TIFF files have not been corrected and will have to be rescanned. That's about 1,6% of all files. All other files were either correct from the beginning or have successfully been corrected.
  • 26 out of 3.103 SIPs still have incorrect master images in them, which is a ratio of 0,8%.
  • 11 new correction recipies have been created to fix a total of 41 errors (as listed above).
  • The validation of a subset of 6.987 files just took us 37m:46s (= 2.266 seconds) on the latest checkit_tiff version, which is a rate of about 3,1 files/sec. For this speed, checking all 107.218 files would theoretically take approximately 9,7 hours. However, this version hasn't been available during all of the correction, so the speed has been drastically lower in the beginning. We think that 24 - 36 hours would be a more accurate estimate.
  • UPDATE: After further improvements in checkit_tiff (commit 22ced80), checking 87.873 TIFFs took only 51m 53s, which is 28,2 TIFFs per second (yes, that's 28,2 Hz!), marking an ninefold improvement over the previous version for this commit alone. With this new version, we can validate TIFFs with a stable speed, independent from their actual filesize, meaning that we can have TIFF validation practically for free (compared to the effort for things like MD5 calculation).
  • 10.774 out of 107.218 TIFF files were valid from the start, which is pretty exactly 10%.
The piechart shows our top ten errors as extracted from all validation runs. The tag IDs are color coded.


This logarithmically scaled graph shows an assembly of all tags that had any errors, regardless of their nature. The X-axis is labelled with the TIFF tag IDs, and the data itself is labeled with the number of error messages for their respective tag IDs.


Up until now, we've invested 26 person days on this matter (not counting script run times, of course); however, we haven't finished it yet. Some steps are missing until the SIPs can actually be transferred to the permanent storage. First of all, we will revalidate all of the corrected TIFFs to make sure that we haven't made any mistakes while moving corrected data out of the way and replacing it with yet-to-correct data. When this step has completed successfully, we'll reject all of the SIPs from the Technical Analyst's workbench in the repository and re-ingest the SIPs. We hope that there won't be any errors now, but we assume that some will come up and brace for the worst. Also, we'll invest some time to generate some statistics. We hope that this will enable us to make qualified estimates for the costs of reparing TIFF images, for the number of images that are affected by a certain type of errors and for the total quality of our production.

A little hint for those of you that want to try this at home: make sure you run the latest checkit_tiff compliance checker with the "-m" option set to enable memory-mapped operation and get drastically increased performance, especially during batch operation.
For the purpose of analysing TIFF files, checkit_tiff comes with a handy "-c" switch that enables colored output, so you can easily spot any errors on the text output.

I want to use the end of this article to say a few words of warning. On the one hand, we have shown that we are capable of successfully repairing large amounts of invalid or non-compliant files in an automatic fashion. On the other hand, however, this is a dangerous precedence for all the people who don't want to make the effort to increase quality as early as possible during production, because they find it easier to make others fix their sloppy quality. Please, dear digital preservation community, always demand only the highest quality from your producers. It's nothing less than your job, and it's for their best.

Dienstag, 22. November 2016

Some thoughts about risks in TIFF file format

Introduction

TIFF in general is a very simple fileformat. It starts with a constant header entry, which indicates that the file is a TIFF and how it is encoded (byteorder).
The header contains an offset entry which points to the first image file directory (IFD). Each IFD has a field which counts the number of associated tags, followed by an array of these tags and an offset entry to the next IFD or to zero, which means there is no further IFD.
Each tag in the array is 12 Bytes long. The first 4 bytes indicate the tag itself, the next 2 bytes declare the value-type, followed by 2 bytes counting the values. The last 4 bytes are either an offset or hold the values themselves.

What makes a TIFF robust?

In the TIFF specification, there are some hints which help us to repair broken TIFFs.
The first hint is that all offset-addresses must be even. The second important rule is that the tags in an IFD must be sorted in an ascending order.
At last, the TIFF spec defines different areas in the tag range. This is to guarantee that the important values are well defined.
If we guarantee that a valid TIFF was stored, there is a good chance to detect and repair broken TIFFs using these three hints.

What are the caveats of TIFF?

As a proof of concept there is also a tool "checkit_tiff_risks" provided in this repository. Using this tool, users can analyze the layout of any baseline TIF file.
The most risky memory ranges are the offsets. If a bitflip occurs there, the user must search the complete 4GB range. In practise, the TIF files are smaller, and so this size is the searchspace for offsets.
The most risky offsets are the ones which are indirect offsets. This means the IFD0 offset and the StripOffset tag (code 273).
Here an example of a possible complex StripOffset encoding:
The problem in this example is that TIFF has no way to find out how many bytes are part of the pixel-data stream. The existing StripByteCounts tag only stores the expected pixel data length after decompression.
This makes the StripOffset tag very fragile. If a bitflip changes the offset of the StripOffset tag, the whole pixel information might be lost.
Also, if a bitflip occurs in the offset area that the StripOffset tag points to, the partial pixel data of the affected stripe is lost.
If compression is used, the risk of losing the whole picture is even higher, because the compression methods do not use an end-symbol. Instead, the buffer sizes as stored in the StripByteCount tag are used. Therefore, a bit-error in the Compression tag, the StripOffset tag, the StripByteCount tag or in the memory-map where StripOffset points to, could destroy the picture information.

Upcoming next…


In upcoming versions of checkit_tiff, we would provide a tool to analyze the distribution of risky offsets in given TIFF-files.  This will objectify the discussion about robust fileformats vs. compression.

Here a short preview:

$>  ./checkit_tiff_risk ../tiffs_should_pass/minimal_valid.tiff

This reports this kind of statistics:

[00], type=                  unused/unknown, bytes=         0, ratio=0.00000
[01], type=                        constant, bytes=         4, ratio=0.01238
[02], type=                             ifd, bytes=       130, ratio=0.40248
[03], type=                  offset_to_ifd0, bytes=         4, ratio=0.01238
[04], type=                   offset_to_ifd, bytes=         4, ratio=0.01238
[05], type= ifd_embedded_standardized_value, bytes=        52, ratio=0.16099
[06], type=   ifd_embedded_registered_value, bytes=         0, ratio=0.00000
[07], type=      ifd_embedded_private_value, bytes=         0, ratio=0.00000
[08], type=ifd_offset_to_standardized_value, bytes=        12, ratio=0.03715
[09], type=  ifd_offset_to_registered_value, bytes=         0, ratio=0.00000
[10], type=     ifd_offset_to_private_value, bytes=         0, ratio=0.00000
[11], type=      ifd_offset_to_stripoffsets, bytes=         0, ratio=0.00000
[12], type=               stripoffset_value, bytes=        30, ratio=0.09288
[13], type=              standardized_value, bytes=        87, ratio=0.26935
[14], type=                registered_value, bytes=         0, ratio=0.00000
[15], type=                   private_value, bytes=         0, ratio=0.00000
counted: 323 bytes, size: 323 bytes


In this example the StripOffset is encoded directly (there are only one stripe). The problematic bytes are the offset-addresses (affected 20 Bytes of 323 Bytes).

In opposite to this example, here a special file using multiple strips:

$>  ./checkit_tiff_risk ../tiffs_should_pass/minimal_valid_multiple_stripoffsets.tiff

This reports this kind of statistics:

[00], type=                  unused/unknown, bytes=         0, ratio=0.00000
[01], type=                        constant, bytes=         4, ratio=0.01250
[02], type=                             ifd, bytes=       122, ratio=0.38125
[03], type=                  offset_to_ifd0, bytes=         4, ratio=0.01250
[04], type=                   offset_to_ifd, bytes=         4, ratio=0.01250
[05], type= ifd_embedded_standardized_value, bytes=        44, ratio=0.13750
[06], type=   ifd_embedded_registered_value, bytes=         0, ratio=0.00000
[07], type=      ifd_embedded_private_value, bytes=         0, ratio=0.00000
[08], type=ifd_offset_to_standardized_value, bytes=        16, ratio=0.05000
[09], type=  ifd_offset_to_registered_value, bytes=         0, ratio=0.00000
[10], type=     ifd_offset_to_private_value, bytes=         0, ratio=0.00000
[11], type=      ifd_offset_to_stripoffsets, bytes=        40, ratio=0.12500
[12], type=               stripoffset_value, bytes=        30, ratio=0.09375
[13], type=              standardized_value, bytes=        56, ratio=0.17500
[14], type=                registered_value, bytes=         0, ratio=0.00000
[15], type=                   private_value, bytes=         0, ratio=0.00000
counted: 320 bytes, size: 320 bytes


Here you can see we have the type 11, which points StripOffset to an array of offset adresses, where the pixel data could be found. This is similar to the diagram above. In this case we have 40 bytes with high bitflipping risk.




Montag, 12. September 2016

Die schlechten ins Kröpfchen, die guten ins Töpfchen

(english version below)
Quelle: Wikisource, public domain
Diese Woche wurde die Bitte an uns herangetragen, uns zu überlegen, wie wir mit invaliden Dateien umgehen, für die wir im Rahmen einer Dienstleistung die Langzeitarchivierung anbieten.

Die Frage kam auf, weil mit dem Dienstnehmer gerade die Übernahmevereinbarung verhandelt wird und wir für unsere eigenen Workflows strenge Validitätskriterien ansetzen.

Zuerst waren wir verunsichert. "Ja, man müsste die validen und invaliden Dateien ja im gleichen Speicher zusammenhalten", aber auch "wenn, dann müssten wir diese Dateien aber als valide oder invalide markieren".

Oder auch: "Die Submission Application könnte doch validieren und je nach Ergebnis die Daten in den einen oder anderen Workflow schieben".

Wir beide haben uns hingesetzt, nochmal unser LZA-System angeschaut und uns intensiver Gedanken dazu gemacht.

Unsere Software erlaubt es nicht, invalide Dateien in das LZA zu lassen. Man muss sich vorher für einen Workflow die Qualitätsparameter überlegen. Wenn wir eigene Validatoren benutzen, können die Qualitätsparameter (Policies) frei gewählt sein. Wir könnten jeden Mist durchlassen oder eben nur streng definierte Qualitätsperlen.
Für diesen Workflow wird all das Material ins Archiv gelassen, das diese erfüllt.

Sind die Anforderungen nicht erfüllt, landen die SIPs beim 'technical analyst', der nur die Wahl hat zwischen
  • "Reject", also Zurückweisung mit Option des erneuten Ingests 
  • "Decline" als generelle Abweisung oder 
  • direkte Reparatur innerhalb der 'technical analyst'-Workbench
Ein Verschieben von IEs durch den 'technical analyst' zwischen verschiedenen Workflows ist nicht möglich.

Eine Mischung von "validen" und "invaliden" Dateien bleibt aber auch nach längerer Überlegung nicht sinnvoll:
Die strikte Trennung der Workflows in unserem Archivsystem dient ja gerade dazu, die IEs à la Aschenputtel "die schlechten ins Kröpfchen, die guten ins Töpfchen" zu sortieren.

Damit steigt die Grundqualität in den  jeweiligen Workflows und, dies ist entscheidend, man hat im Falle der Formatmigration weniger Fehlerfälle und geringeren Aufwand.

Durch eine, wie auch immer geartete, Markierung, die in unserem System aber nicht direkt möglich ist, würden wir die Einhaltung unserer eigenen Policies gefährden.
Dies führte dann auch dazu, dass man sich von der Maxime leiten ließe, "später (wenn wir mehr Personal/Zeit/bessere Technik haben) können wir das ja vielleicht mal reparieren". Dass dieser Ansatz funktionieren soll, konnte uns bisher kein Archiv zeigen.

Eine Validierung innerhalb der Submission Application ist genauso wenig  sinnvoll. Sie soll weder jetzt noch zukünftig die Aufgaben des Archivsystems übernehmen.  Dies würde sonst dazu führen, dass man Teile des bestehenden LZA-Systems selbst nachbauen würde.

Gegenüber dem Dienstnehmer würden wir so argumentieren, dass dieser ja bei uns die Dienstleistung "Langzeitarchivierung" einkauft. Wir werden bezahlt, ihre Qualität hochzuhalten, oder in Prosa: "Der regelmäßige Tritt in den Hintern des Dienstnehmers führt zu Glücksgefühlen und ist ihm viel Geld wert."
Wer das nicht für sinnvoll erachtet, dem dürfte ein einfacher Sicherungsdienst ausreichen. Dafür gibt es genügend Anbieter am Markt.

Fazit

Manchmal braucht es ein paar Minuten nochmal über die eigene Rolle als Langzeitarchivar nachzudenken. Und es ist gut, wenn wir uns auch unter Druck diese Zeit nehmen.

Nachtrag

Eine weitere Möglichkeit wäre, in unserem Langzeitarchivsystem den Speicherbereich, in dem die Fälle des 'technical analyst' landen, stärker abzusichern (zB. durch 3-fache Kopien).

Damit würden all die IEs, die valide sind weiter in den Langzeitspeicher wandern und wären bestens langzeitarchiviert.

Und all jene IEs, die nicht vollumfänglich valide sind, landen im Speicherbereich des 'technical analyst' und würden bei Reparatur oder nach spätestens 10 Jahren dort gelöscht. Dieser Speicherbereich sichert dann dem Dienstnehmer  nur 'bitstream preservation' zu und für diesen bleibt das Risiko und der Reparaturaufwand transparent.

Der Druck die IEs sauber ins LZA-System einzuliefern kommt durch die deutlich höheren Speicherkosten für den Zwischenspeicherbereich zustande, da dieser auf auf Festplatten und nicht auf Band  basiert.

The good must be put in the dish, the bad you may eat if you wish.

Just this week we were asked to develop a stretegy for treating invalid data that we provide a digital preservation service for. The question arose because we currently are in the negotiations of the transfer agreement with our customer and we set a very strict quality policy for our own workflows. At first, we were a little uneasy. "Yes, valid and invalid data would need to be kept in the same storage.", but also "if we do this, then we'd have to flag valid and invalid files to keep them apart." Or in other words: "The Submission Application could run the validation and move the data through different workflows, depending on the validation result." So we both sat down, took a deeper look into our preservation system and contemplated a little longer on this problem.

Our software does not allow invalid data into the permanent repository. The quality parameters for each workflow have to be concepted in advance. If we use our own validators, the policy for the quality parameters can be chosen freely. We either could allow any crap or just maticulously chosen pearls of quality through to our permanent repository. For the workflow that will be configured, all material that complies with whatever policy we set up will be let through to the permanent storage. If the requirements aren't met, the SIPs end up with the 'technical analyst', who now has to choose between:

  • "Reject"ing the ingest with optional re-ingest
  • "Decline"ing the ingest and disallowing re-ingest
  • immediate repair inside of the 'technical analyst' workbench

The 'technical analyst' cannot move the SIP between workflows.

In conclusion, mixing "valid" and "invalid" files doesn't seem sensible. The strict workflow separation in our preservation system is there for the sole purpose of sorting the SIPs like Cinderella dit with the lenses: "The good must be put in the dish, the bad you may eat if you wish." This increases the basic quality level for corresponding workflow and, this is very important, lowers the efforts and number of error cases in case of a format migration.

By using whatever kind of flagging (which isn't possible with our current system), we would endanger the enforcement of our own policies. This would make us follow the maxime that "we can fix this later, once we have more personnel/time/better technology". However, until now, no archive could show us proof that this concept actually worked.

Running the validation inside of the Submission Application isn't sensible, either. It's not the Submission Application's job to take over any tasks from the preservation system. Memory institutions generally will want to avoid re-implementing parts of the preservation system.

In a discussion with our customer we would argue that they are paying us for the service of preserving their content and keeping it useable over long time periods. We are paid to keep their quality high, or to paraphrase: "Kicking the customers' bottoms is part of the service that you are paying us a lot of money for and is useful for both sides alike." Any institutions that don't think this is necessary will be better off using one of plenty of normal backup services available on the market.

Conclusion

It sometimes might take a few extra minutes to contemplate one's own role as a digital preservationist. And it's good to take this time even and especially when we're under pressure.

Adendum

Another alternative of solving this issue could be to further secure the storage area of our preservation system that is reserved for SIPs that end up in the 'technical analyst' workbench, i.e. by providing three copies on storage layer. All valid IEs would keep going directly to the permanent repository and be cared for perfectly. All those IEs that are not fully valid would end up in the 'technical analyst's storage, where they could be stored for no longer than 10 years before being deleted. For this storage area, we'd only guarantee 'bitstream preservation', with the risk and the effort needed for repair operations being transparent for the customer. A further incentive to ingest "clean" IEs only into the preservation system is generated by the considerabely higher cost of for this storage area, as it is based on hard disk drives instead of the cheaper tape storage.

Freitag, 19. August 2016

Image File Directories reparieren

(english version below)

Vor Kurzem hatten mein Kollege Andreas und ich eine Diskussion darüber, wie man das defekte Image File Directory (IFD) einer TIFF-Datei reparieren könnte. Er hatte dazu eine Änderung in fixit_tiff eingebaut, die ein neues und korrigiertes IFD an das Ende der TIFF-Datei schreibt und das IFD-Offset einfach auf das neue IFD zeigen lässt, so wie es auch in der libtiff vorgesehen ist. Das ursprüngliche (defekte) IFD, das üblicherweise irgendwo am Anfang der Datei steht, wird dabei nicht verändert und liegt auch in der neuen Datei wieder vor. Der einzige Unterschied ist, dass jetzt das IFD-Offset nicht mehr auf das alte IFD zeigt und es damit von TIFF-Readern nicht mehr ausgelesen wird. Die Datei wächst also mit jeder Änderung am IFD an und es bleibt immer mehr "Müll" in der Datei zurück. Andererseits hat man aber auch eine Art Historie, weil alle bisherigen IFD-Versionen erhalten bleiben.

Für mich fühlte sich diese Methode unsauber an, weil ich den Meinung bin, dass Dateien keinen unreferenzierten Datenmüll enthalten sollten; insbesondere im Kontext der Langzeitarchivierung. Nicht nur ist es eine Verschwendung von Speicher, sondern späteren Datenarchäologen könnten daraus auch Probleme erwachsen, wenn sie versuchen, die Dateien zu interpretieren und dabei unreferenzierte Datenblöcke vorfinden.

Überraschenderweise (zumindest für mich) ist das Vorgehen aber völlig konform mit der Spezifikation und hat darüber hinaus noch weitere Vorteile.
  1. Die Änderung ist schnell und billig. An die bestehende Datei müssen nur ein paar Kilobytes angefügt und der IFD-Offset im Dateiheader korrigiert werden.
  2. Wie erwähnt bleibt die "Historie" erhalten.
  3. Alle anderen Offsets, die in der Datei verwendet werden, können unverändert bestehen bleiben. Dadurch wird das Verfahren sehr robust und fehlerunanfällig.
Inzwischen ist nun die Implementierung geändert worden. Da wir meist nur TIFF-Tags ändern oder löschen ist es unwahrscheinlich, dass das IFD sich vergrößert. Daher wird nun das IFD an Ort und Stelle verändert. Nun entsteht zwar zwischen dem IFD und den Bild-/Nutzdaten ein Leerraum, in dem potentiell Datenmüll steht, aber auch das wäre laut TIFF-Spezifikation erlaubt. Außerdem ist der Leerraum bedeutend kleiner als der große Block des ursprünglichen IFDs.

Es gibt aber noch eine dritte Möglichkeit, das Problem zu lösen. Dabei würde man die TIFF-Datei komplett neu schreiben, so dass keine Lücken zurückbleiben würden. Nach meinem Dafürhalten ist das die sauberste Option. Sie hat allerdings handfeste Nachteile.
  1. Der Entwicklungsaufwand ist hoch. Um die ganze Datei zu lesen, die einzelnen Bestandteile sicher zu verwalten und zu ändern, muss einiges an Programmcode geschrieben werden.
  2. Alle Offsets müssen geändert werden. Dieser Prozess ist fehleranfällig und bewegen die Datei weiter vom Original weg. Außerdem müssen auch Offsets innerhalb von privaten Tags geändert werden. Da dort aber die innere Struktur oft unbekannt ist, kann nicht sichergestellt werden, dass alle Offsets nach dem Schreiben noch korrekt sind.
  3. Die Datei muss neu geschrieben werden. Dabei steigt die Wahrscheinlichkeit, dass Bitfehler auftreten.
  4. Die Daten können nicht mehr verarbeitet werden, ohne die komplette Datei in den Speicher zu laden.
  5. Durch den hohen Aufwand wird ein höherer Datendurchsatz benötigt, die Hardware wird stärker belastet und es entsteht mehr Rechenzeit. Bei großen Mengen an zu korrigierenden Dateien kann sich dieser Mehraufwand deutlich bemerkbar machen.
Mich würde vor allem interessieren, ob es in der Community dazu schon Meinungen oder Best Practices gibt, und wie diese lauten. Welche Variante wird bevorzugt? Ein neues IFD anhängen, das bestehende IFD an Ort und Stelle ändern, oder die Datei komplett neu schreiben?
Ich selbst bin immer noch hin und her gerissen zwischen meinem Qualitätsanspruch einerseits und den hohen Kosten dafür andererseits.



english version

My colleagure Andreas and I recently had a discussion about the best way to fix the broken Image File Directory (IFD) of a TIF file. He had implemented a change in fixit_tiff to write a new corrected IFD to the end of the TIF-file and correct the IFD offset, just the way it is stipulatey by the libtiff. The original (defective) IFD, that is usually placed somewhere in the beginning of the file, is not changed at all and can be found in the new file as well. The key difference is that the IFD offset doesn't point to the old IFD anymore, making TIFF readers ignore it. Hence, the file grows with every change in the IFD and more and more "garbage" is kept in the file. On the other hand, a kind of version history is created inside of the TIFF itself, because all former IFD versions are kept.

For me, this method seemed unclean, because I think that files should not contain any unreferenced garbage date; especially so in the context of digital preservation. It's not only a waste of storage, but might lead to problems if data archaeologists one day try to interpret the data and find unreferenced blocks.

Most surprisingly (at least for me), this method completely complies with the TIFF specification and also brings some further advantages:
  1. It's a fast and cheap change. Only a few kilobytes need to be added to the file, and the IFD offset needs to be corrected.
  2. As mentioned before, the "history" stays intact.
  3. All other offsets used in the file can remain the same as before, which makes this method quite robust and sturdy.
In the meantime, the implementation has been changed. Given that we usually just change or delete TIFF tags, it's improbable that the IFD grows in size. Hence, now the IFD is changed in-place. Now there's a little free space between the IFD and the payload data, but that's supported by the TIFF specification as well. Moreover, the space is much smaller than the large block of the original IFD that remained when using the first method.

There is a third method to solve the problem. A TIFF writer could completely rewrite the entire file, leaving no free spaces whatsoever. In my opinion, this is the most elegant option. It does, however, have some serious disadvantages.

  1. The development would be quite an effort. A lot of code needs to be written to read, manage and alter all componentes of the file.
  2. All offsets need to be altered. This process is error prone and brings the files further from their original state. Furthermore, offsets inside of private tags need to be changed as well. However, as their inner structure is often unknown, no-one can make sure that all offsets are still correct after writing the file.
  3. The entire file needs be be rewritten. During this process, bit flip errors might occur.
  4. The files cannot be processed without loading the whole file into the RAM.
  5. Larger I/O capacities are needed, the hardware is stressed more and more CPU cycles are burnt. If there are larger amounts of files the need correction, this effort will surely be noticeable.
I would be most interested if the community already has opinions or best practices on this matter, and what these are. Which option do you prefer? Attaching a new IFD, altering the existing IFD in-place, or re-writing the whole file?
I myself am torn between my personal demand for high quality standards on the one hand and the high costs to reach them on the other hand.

Montag, 25. Juli 2016

ICC Farbprofile von TIFFs prüfen


Kaputte ICC Einbettung in TIFFs



Bei einigen TIFFs sind uns Fehler aufgefallen, weil die Größenangaben des ICC Profils nicht mit denen des TIFFs übereinstimmten.

Aus diesem Grunde hatten wir checkit_tiff eine Prüfroutine für die ICC-Header verpasst.

Hier ein Beispiel einer Ausgabe:


$ ./checkit_tiff -c /tmp/00000056.tif ../example_configs/cit_tiff6_baseline_SLUB.cfg
'./checkit_tiff' version: master
    revision: 85
licensed under conditions of libtiff (see http://libtiff.maptools.org/misc.html)
cfg_file=../example_configs/cit_tiff6_baseline_SLUB.cfg
tiff file=/tmp/00000056.tif
check if all IFDs are word aligned
check if only one IFD exists
check if tags are in ascending order
check if all offsets are used once only
check if all offsets are word aligned
check if tag 306 (DateTime) is correct
check if tag 34675 (ICC Profile) is correct
==> tag 34675 (ICC Profile) should have value pointing to valid ICC profile, but has value (values or count) preferred cmmtype ('APPL') should be empty or (possibly, because ICC validation is alpha code) one of following strings: 'ADBE' 'ACMS' 'appl' 'CCMS' 'UCCM' 'UCMS' 'EFI ' 'FF  ' 'EXAC' 'HCMM' 'argl' 'LgoS' 'HDM ' 'lcms' 'KCMS' 'MCML' 'WCS ' 'SIGN' 'RGMS' 'SICC' 'TCMM' '32BT' 'WTG ' 'zc00'
check if tag 256 (ImageWidth) has value in range 1 - 4294967295
check if tag 256 (ImageWidth) has valid type
check if tag 257 (ImageLength) has value in range 1 - 4294967295
check if tag 257 (ImageLength) has valid type
check if tag 258 (BitsPerSample) has these 3-values: 8, 8, 8
check if tag 258 (BitsPerSample) has valid type
check if tag 259 (Compression) has value
check if tag 259 (Compression) has valid type
check if tag 262 (Photometric) has value in range 0 - 2
check if tag 262 (Photometric) has valid type
check if tag 273 (StripOffsets) exists
check if tag 273 (StripOffsets) has valid type
check if tag 277 (SamplesPerPixel) has value
check if tag 277 (SamplesPerPixel) has valid type
check if tag 278 (RowsPerStrip) has value in range 1 - 4294967295
check if tag 278 (RowsPerStrip) has valid type
check if tag 279 (StripByteCounts) has value in range 1 - 4294967295
check if tag 279 (StripByteCounts) has valid type
check if tag 282 (XResolution) has value in range 300 - 1200
check if tag 282 (XResolution) has valid type
check if tag 283 (YResolution) has value in range 300 - 1200
check if tag 283 (YResolution) has valid type
check if tag 296 (ResolutionUnit) has value
check if tag 296 (ResolutionUnit) has valid type
check if tag 254 (SubFileType) has value
check if tag 254 (SubFileType) has valid type
check if tag 266 (FillOrder) has value
check if tag 266 (FillOrder) has valid type
check if tag 271 (Make) has  value matching regex '^[[:print:]]*$'
check if tag 272 (Model) has  value matching regex '^[[:print:]]*$'
check if tag 274 (Orientation) has value
check if tag 274 (Orientation) has valid type
check if tag 284 (PlanarConfig) has value
check if tag 284 (PlanarConfig) has valid type
check if tag 305 (Software) has  value matching regex '^[[:print:]]*$'
check if tag 306 (DateTime) has  value matching regex '^[12][901][0-9][0-9]:[01][0-9]:[0-3][0-9] [012][0-9]:[0-5][0-9]:[0-6][0-9]$'
check if tag 34675 (ICC Profile) exists
check if tag 34675 (ICC Profile) has valid type
check if forbidden tags are still existing
found 1 errors

Extraktion und Weitergehende Analyse  des ICC-Profils


Für eine weitergehende Analyse kann man ff. Vorgehen wählen:
  • Mit dem Werkzeug "exiftool" das ICC-Profil extrahieren:
    exiftool -icc_profile -b -w icc /tmp/kaputt.tiff
  • Mit dem ICC Profiler "profiledump" das extrahierte ICC-Profil "/tmp/kaputt.icc" laden und validieren:
    Windows: wxProfileDump.exe
    Linux: wine wxProfileDump.exe 
Hier die Beispielausgabe:


Montag, 11. Juli 2016

Warum "AIPUpdate" notwendig ist

In der Diskussion mit Archivaren ist mir in letzter Zeit immer wieder aufgefallen, dass diese mit dem Begriff "AIPUpdate" nichts anzufangen wissen und daher auch nicht verstehen, warum aus der Sicht von Bibliotheken das Thema "AIPUpdate" in der Überarbeitung des OAIS Referenzmodells mit aufgenommen werden sollte.

Klassische Archive


Ein klassisches Archiv arbeitet nach dem Provinienzprinzip, d.h. Archivalien werden nach ihrer Herkunft bzw. Entstehung geordnet. Meist erfolgt diese Ordnung in Form von Akten durch die schriftgutbildende Behörde. Wenn dieses Schriftgut an das Archiv übergeben wird, so handelt es sich dabei um abgeschlossene Dokumente.

Aus Sicht der klassischen Archive sind Änderungen am Archivgut nicht mehr zu erwarten. Diese Überzeugung hat sich auch im Bereich der digitalen Langzeitarchive erhalten.

Bibliotheken und Museen


Anders die Situation in den Bibliotheken und Museen. Diese arbeiten in aller Regel nach dem Pertinenzprinzip, d.h. der Ordnung nach Sachgruppen. Durch die damit einhergehende unterschiedliche Erschließung nach Sachgruppen können Dokumente zu unterschiedlichen Zeiten unterschiedlich gut tiefenerschlossen sein. Die Erschließung ist auch nicht immer perfekt, weil für viele Quellen bestimmte Informationen erst nach und nach durch die Geschichtswissenschaften ermittelt werden können.

Hinzu kommt, dass durch die schiere Menge von Digitalisaten allein im Projekt VD18 Fehler in der Digitalisierung entstehen, die nicht immer sofort auffallen.

Außerdem müssen Bibliotheken und v.a. auch Museen digitale Dokumente (z.B. elektronische Installationen) schon heute noch während der eigentlichen Lebenszeit langzeitarchivieren.

All diese Punkte führen dazu, dass im Gegensatz zu Archiven die Langzeitarchivierung in diesem Bereich mit teilweise unvollständigen, sich noch ändernden Dokumenten oder Dokumententeilen zu tun hat.

Mit dem AIPUpdate ist es möglich, zu einem bereits im Langzeitarchiv gesicherten Stand nachträglich eine vergessene Seite hinzuzufügen, einen Fehler zu korrigieren oder Metadaten zu ergänzen.

Prinzipien AIPUpdate


Damit AIPUpdate funktioniert, bedarf es in Langzeitarchivsystemen der Einhaltung folgender Prinzipien:
  1. Saubere Verwaltung eines persistenten Identifiers für die korrekte Zuordnung des Update zu im Langzeitarchiv befindlichen Vorgang
  2. Versionsverwaltung der AIPs im Langzeitarchiv
  3. Verbot der Löschung von "alten" AIPs (damit alle Versionen nachvollziehbar bleiben)
Da ein AIPUpdate immer auch eine Belastung für das Langzeitarchivsystem darstellt (z.B. Schreib-Lesevorgänge auf Bandspeichern, aber auch durch das Processing), sollten solche Operationen möglichst gebündelt ausgeführt werden.