Friday, September 22, 2017

Calling Image Magick from PowerShell

I am old enough to know what a BAT file is. I used to know MSDOS 3.0 inside and out and created all kinds of tricky BAT scripts for performing lots of useless (and a few useful) tasks. Anyone else that has ever done this knows that you were likely at some point to be tricked by the command processor's funky syntax and semantics. You would get stuck for hours trying to figure out why something that should be obvious was not working as you expected.

Enter Windows NT and CMD. Things were much improved. There was a lot more power and the chance to create even trickier CMD scripts. But, the problem of being out-tricked by the tool was ever present. It seemed I was bound to pull some amount of hair for any non-trivial script I would create (as evidenced by minimal hair remaining on my head).

Fast forward to present day and Microsoft has deprecated all those old scripting technologies in favor of PowerShell. I love it! The days of tricky code are gone. Modern language features, predictable syntax and semantics, powerful tools to perform typical functions, it's all there... until I tried to call an EXE.

The days of me spending hundreds of hours to become an expert in any of this technology are long gone, but it still has its uses and total expertise is not needed (not to mention that Mr. Google is a close friend of mine). I have written scripts to do things I would have never imagined in a BAT or CMD.  I can call remote web services with minimal effort. Two lines of PowerShell code can do what would have taken 200 in a BAT. Then I needed to call Image Magick.

Image Magick is a powerful and free tool for processing images. It does hundreds of things I can never imagine needing and a number I can see using all the time. This time I just needed to combine single pages of scanned TIFF files into a multi-page TIFF file. Magick can do this with a trivial command line statement.
magick file1.tif file2.tif filecombined.tif
Actually I needed a few options to really do what I wanted, but still pretty straight forward.
magick -quiet file1.tif file2.tif -compress JPEG filecombined.tif
I thought it would be simple to add to my PowerShell script just like calling a CmdLet. Not so much.

What I really wanted to do in PowerShell was something like the following.
magick -quiet $ListOfFiles -compress JPEG $Destination
Where my list of input files and my destination file were stored in variables. Well, it was obvious early on that Magick was having a hard time to digest the parameters I was passing. A little playing around with quotes and things seemed to get it sorted out, but I never really understood why it worked or why it did not work as the case may be. However, I was ready to put my script to its one-time use.

I rather like PowerShells Start-Transcript CmdLet. It really simplifies logging for one-time scripts like the ones I often need to create. What I did not realize was that I had accidently developed my script on PowerShell v4, which does not support this handy tool. So, I upgraded to v5 only to find that everything else fell apart.

It turns out that not really knowing why my script was calling Magick correctly or incorrectly meant that a subtle change in EXE processing between v4 and v5 PowerShell was enough to break everything. Well, I went back to my brute force method of trying all sorts of variations only to be stymied at every attempt.

Off to ask Mr. Google about my problem. Turns out that calling an EXE in PowerShell is something special and sometimes it works and sometimes, not so much. If fact, there is a long list of methods of making such calls published by Microsoft. TechNet has an article on the complex topic as do many other blogs and StackExchange questions.

PowerShell: Running Executables

Unfortunately, I could not find one article that would directly solve my issue. The Magick method of using the @ sign to get a list of input files from a text file looked promising, but only resulted in really strange Magick errors like "magick.exe: unable to open image '@z:ÿþz'". What?

What finally did bare fruit was using an array to pass the parameters to the executable along with the call (&) operator. Once I got this setup, it worked perfectly (and logically) every time. Now my code looked a little more like the following.
$MagickParameters = @( '-quiet' )
$MagickParameters += 'file1.tif'
$MagickParameters += 'file2.tif' 
$MagicParameters += @( 'compress', 'JPEG' ) 
 $MagickParameters += 'filecombined.tif'
&'magick' $MagickParameters
Simple. I realize my method of recreating the array to add new parameters is not super efficient, but performance was not a factor here and a more efficient method would have been a waste for a one-time script. Obviously, there was other code intermixed for actually getting the filenames, but I have saved you from the unnecessary complexity for this example.

Unlike my tricky BAT and CMD scripts from the past, the PowerShell solution is pretty clean once you know the trick.

Saturday, January 10, 2015

Accessing the encrypted file path in ImageNow

Intellectual property protection

Many years ago I took on a project to determine the protocol used to communicate between a Norand hand-held computer and a Norand 4820 printer. Norand called this protocol the Norand printer control protocol (NPCP). This was a turnkey proprietary solution with Norand controlling hardware and software. It was designed for the connection between a portable computer and printer for route accounting and sold primarily to bakery and beverage delivery companies.

As the Norand solution became older and the cost of functionally similar printers decreased well below the Norand product cost a demand opened up for third party printers that could be used in place of the Norand printers. The only problem was that the portable computers would only print using NPCP; a protocol that was unpublished.

At the time the world was starting to see the first free and open source software. Proprietary computing systems were giving way to DOS, the new Windows, Unix, XMODEM, and TCP/IP. The proprietary Norand protocol seemed like a dinosaur. After a bunch of digging through serial traces and obscure NEC uPD7810 microprocessor code I was able to create a Norand printer emulator. What became clear in this process was that Norand designers had intentionally obfuscated the NPCP protocol and the code that implemented it.

In many ways this is no different than HDMI encryption or software anti-piracy mechanisms. The intellectual property owners want to protect their property.



Content as a hostage

Wind forward to today and I have a customer that would like to extract all their documents from a document management system called ImageNow from Perceptive Software. They have added millions of pages to their ImageNow system over the years, but the annual cost for maintaining the software is prohibitively high and much lower cost and equally capable products are now available in the marketplace.

The documents, the “content”, in ImageNow is unquestionably owned by my customer. The software is installed on a Windows server, the data is stored in an SQL Server database, and the pages are stored in files in the Windows file system. A reasonable expectation for them would be to terminate their contract with Perceptive Software and be able to access their content without use of ImageNow or further cost from Perceptive. Unfortunately it seems this is not possible as Perceptive has encrypted one critical piece of information that describes the relationship between the documents stored in ImageNow and the pages stored in the Windows file system: the image file path.

I cannot think of any reason that this encrypted file path adds value for any customer of Perceptive. It appears to be in place to force end users to engage Perceptive for additional software and services to extract their documents. The content that is owned by my customer, not Perceptive, is effectively being held hostage by this encrypted file path.

Perceptive does sell an option called ImageNow Output Agent that can export mass documents and may help for exporting documents, but customers in this situation probably do not own this option.

A content lock is a fairly dangerous situation for users of document management systems. End users are left with a difficult or impossible situation if the manufacturer of the software goes out of business, gets acquired, or otherwise changes their plans to continue to support the software. My customer is in a little better position with Perceptive, but one that will require them to spend new money at a time they are trying to downsize their document management system expenditure.



Unlocking the encrypted file path in ImageWare

The good news for my customer and others in this situation is that there is a way to get their file path without engaging Perceptive or purchasing any other modules of ImageNow. They will, however, need the skills and time of a Javascript programmer.

ImageNow includes a scripting language called iScript. This is basically Javascript with some libraries specific to ImageNow. This scripting environment can access many or all of the objects stored in ImageNow.

A search of the ImageNow 6.x specific object documentation will find that the INLogicalObject provides information about the actual files stored in the file system. However, it does not contain any information about the file path. A little closer inspection under the hood of the object reveals that it does have a file path field and the value is not encrypted. It is a member of INLogicalObject. The following very simple example shows finding a single document and displaying its file type and unencrypted file path on the console.

 // get a single document  
 var results = INDocManager.getDocumentsBySqlQuery( "", 1, var more );  
 if ( results )  
 {  
     var doc = results[0];  
     doc.getInfo();  
     // get a single page for the document  
     var logob = INLogicalObject( doc.id, -1, 1 );  
     logob.retrieveObject();  
     printf( "file type: %s\n", logob.filetype ); // this member is in the documentation  
     printf( "unencrypted file path: %s\n", logob.filepath ); // this member is not in the documentation  
 }  
If you would like to tackle iScript, the documentation appears to be available with an ImageNow installation. There is a small amount of iScript information on the web including a nice introduction from Blaine Linehan of Wichita State University. He has a blog on programming with iScript for beginners.


Friday, February 8, 2013

Bose Quiet Comfort 3 are the Greatest Headphones Ever!

So often you read product reviews from people that just acquired the product. They have only had a limited opportunity to put the product through its paces and are really only able to give a first impression. At best you might read a review after a few months of use.

I have been using the Bose Quiet Comfort 3 headphones since 2008, nearly five years. It all started when, for the first time in my career, I did not have my own office with a door. The workplace had gone open-concept and the associated noise was a new experience. I wanted something to dim the drone and I had heard that noise cancelling headphones might do the trick.

Someone else in the office had received a set of Bose QC3s for a gift. I gave them a try and they were great; then I saw the price. That got me looking around. Surely I could do better. I started researching headphones and found a lot of reported downsides to the Bose units: they were too fragile, they used a proprietary battery, and they were expensive. I did find some Sennheiser headphones that sounded pretty good and they were half the price of the Bose, so I bought them; big mistake. They just did not cut it. They were not very comfortable and they had no-where-near the noise cancellation of the Bose headphones I had, had a chance to try.

I decided to take the $400 (plus tax) plunge when Bose had a $50 accessory deal. You would get your choice of accessory to a value of $50; I went for the spare battery after hearing horror stories of dead batteries on long plane flights.

From day one I was not disappointed. They worked just like the first ones I tried and they were comfortable enough for me to use all day long. I was expecting to be swapping out batteries everyday, but to my surprise the battery lasts forever. I listen to Radio Paradise when I work and I can play that all day long all week long on a single charge. I can make 10 hour each way round trip flights and not worry about running out of power, though I have the spare battery charged and ready to go if I need it. The airplane adapter is great cleaning up the signal on the old school planes that were built before the iPod generation.

My very first flight with the headphones was almost a disaster. They were new to my travel kit and I left them on the plane after an Air Canada flight to San Francisco. I figured I had just thrown $400 down the drain, but I left my name with Air Canada operations and they got them back to me via Montreal a few days after I returned. They come with a nice little case including a spot for a business card, so it is pretty easy to keep all the parts together and identifiable.

After about three years I noticed that the ear pads had started to wear and needed to be replaced. For about $40 including shipping, Bose sent me a new set of pads that were easy to replace for the old ones. After nearly five years one of the two cords that came with the headphones started to separate between the wire and the connector sheath. Unfortunately I could not find a replacement at bose.ca, but I found something compatible on eBay from China. For $7, including shipping, I got a new cable. It is not quite the quality of the original, but it is working just fine.

I loved my Bose QC3s from the first time I tried them and I still love them five years later. They have only needed minor maintenance and a little care. They are always put away in their custom fit case after use. They can still be purchased today. They are not any cheaper today than they were five years ago, but I must say they are worth every penny. I see they now include an iPhone/iPad adapter. My total cost over five years has been just about $500. At 200 work days per year, averaging about 5 hours per day of use, I figure I have paid about 10 cents an hour for a comfortable and quiet working environment.

While I cannot claim to have tested a wide variety of headphones, I sure think my Bose QC3s are the greatest headphones ever!

Thursday, January 24, 2013

Breaking the Folder Habit : Document Management Without Shackles

An Electronic Document Filing Disaster

Did you ever create folders on your computer to help you organize your documents? Did it seem like a good idea at first, but later you realized your folders did not fully categorize your ideas, so you added some more folders inside your existing folders? Did you notice that you repeated this process a few times and after a while you had so many folders that you stopped putting documents in the correct folder when you were busy, because it was too much trouble?
image
If any of this sounds like you, do not worry; you are not alone. Many individuals and companies have gone through this cycle; often many times. For some, the end is to purchase a document management system. These systems offer the promise of better organization for documents in addition to security, audit controls, and much more.

With a shiny new document management system installed, many library administrators proceed to setup their ideal library structure. They have had the benefit of creating and recreating the structure many times on their old shared folder based systems, so they are experts. Of course, their structure is an elaborate and accurate taxonomy of their document management requirements.

Soon they discover users are misfiling documents or not putting them in the document management system at all. Others are complaining about how long it takes to file a document. Upon investigation it turns out that some users are unsure how the complex taxonomy applies to their document or their document should be in two or more places at once. In any case, it is enough trouble that are more likely to file it on their local drive “temporarily” and get around to putting it in the document management system when they have more time.

In the same environment, when someone wants to find a document they are faced with digging through the structure to get what they want. The only other option they have is to perform an unstructured search and hope the document contents will help to reveal its location. A user wanting to find all timesheets from last week in my example screen shot above would need to drill into each project down to the timesheet folder and then look to the document name or modified date to hopefully find the timesheets with last week’s entries.

Of course, the obvious solution to finding the timesheets is to add a ninth level of hierarchy for each week and ask users to put their timesheet into the correct folder. I think you get the point.

How did we get to think this way? Do not blame yourself. Organizing files into a hierarchy started with Multics and Unix back in the early 1970’s and dramatically broadened with MS-DOS 2.0 in the early 1980’s and nearly every operating system since. We all have many, many, years of conditioning.

Easy Document Filing

You can avoid the look of befuddlement with a small change to your thinking. First, extract the metadata from your folder hierarchy. It is not as tough as it might seem at first and there is a bit of guidance here. You can organize this metadata according the types of documents you are storing; some will need more and some less. You may decide that some metadata should be mandatory and some optional.

Your document management system will almost certainly give you a place to configure this metadata; after all, that is what they are supposed to be good at. Some document management systems will directly use the term metadata. Others will substitute indexes, properties, tags, labels, and possibly others to mean the same thing. The point is that this is information that describes the document without the need to modify the contents of the document.

Folders are very good for setting access control for a variety of documents. Create as many as you need, but no more. You do not want users to need to think too hard about putting documents away. By restricting permissions for users to only areas related to their job, they may only need to look through a very small list of folders; perhaps only one!

With a metadata based approach, when a user stores a document they will be asked to identify the document type. Some document management systems may call this the document schema or document class. They will also be asked to add values for the metadata that has been specified for the document type. Rather than navigating a tree of folders to find the right place to store the document they will see one piece of information at a time and the specific location where the document is stored will be less important. Where a set of metadata values can be predetermined the user may be able to select from a list to help ensure accuracy. Depending on the source of the document and the capability of the document management system it may be possible to automatically extract metadata directly from the document.

I often hear library administrators complain that they cannot rely on users to fill out metadata. Unfortunately these are the very same users can will not search the folder hierarchy to store the documents in the correct place either. The advantage with the metadata approach is the question and answer style of collecting the metadata. In the worst case, if users are in too much of a hurry to enter all the metadata, at least they would be able to easily put the document in the document management system rather than their local drive.

An added benefit of adding metadata to a document is that the information lives with the document regardless of where the document is located. When a folder structure is used to represent meta information about a document the relationship is indirect. Moving a document from one part of the library to another will cause the original meta information to be lost and new meta information to be added regardless of whether or not that is what was intended.

Locating Documents Easily

Library administrators will often cite the ease of locating documents as the reason for creating the deeply structured folder hierarchy. The reality is that is far from the truth. When was the last time you browsed through a tree of options to find something on the web? You almost certainly have a few personal bookmarks, but you probably find most things using a search engine like Google. This sort of searching capability is the hallmark of many document management systems.

Most document management systems have a method for individuals to bookmark documents that are important to them. Other names for a bookmarking like feature include shortcuts, favorites, and virtual folders. In this case it is not up to the library administrator to predetermine what you should care about and how you should find it. The individual user gets a chance to customize their environment according to how they work.

When it comes to searching, I know many of you are thinking that Google does not always find everything you are looking for. You do not want to loose your critical documents when you cannot think of just the right search term. It is true that document management systems can often use the text contents of documents to perform a full text search, but even better results are possible when the search is looking for specific metadata. In fact, you can expect that with metadata searching your chances of finding a document will never be worse than with hierarchical folders and will generally be much better.

I use Gmail extensively and it has a very good search capability, but I can enhance it. Gmail allows me to add one or more labels (metadata) to my email message and it allows me to qualify my search by specifying the labels in addition to my text. This already provides a huge improvement on helping me find my documents and it is an enormous leap over sorting my emails into individual folders in Microsoft Outlook.

Back in the document world, users have typical ways they need to find their documents on a day-to-day basis. These need to fit with the operational processes that they support in their jobs. They tend to remain fairly constant. In my previous example an accounting clerk was looking for last week’s timesheets for all projects. They do this search every Monday to update their project accounting,  so they do not want to worry about forming a special search each week. Document management systems provide a solution here as well.

Many different types of software systems containing files have a method of saving a search to use over and over again. It is a way of looking at your documents as though they were in a folder, but the folder is virtual and dynamic rather than hard wired into a hierarchy. One of the first examples of this was the saved search in BeOS. Apple MAC OS X borrowed this as “Smart Folders” and Microsoft Outlook has now has “Search Folders”. Adobe Lightroom has “Smart Collections”, M-Files has “Dynamic Views” and FileHold has “Saved Searches” to name a few.

A common complaint about using metadata instead of a folder hierarchy is that simple metadata fields cannot convey a complete taxonomy that is best reflected in a tree. Document management systems can often solve this problem by simply providing a tree structured metadata list such as the drill down fields in FileHold.
image

In Conclusion

  • A document management system will help you store and organize your documents.
  • Computer users have years and years of hierarchical folder training in their heads that will get in the way of implementing a better approach to storing documents. Help them get over it.
  • Use folders to control access not to represent meta information.
  • Find a document management system that gives you the capability to create folders that are virtual and dynamic based on search criteria against metadata.

Friday, October 19, 2012

Exporting Metadata from Canon ImageWare Document Manager

Note: Canon ImageWare Document Manager (iWDM) 4.1 Workgroup Edition was used for this analysis, but I expect the details are the same or very similar for other editions and minor versions.

[Updated 2015-04-22]

ImageWare is like many document management systems sold by scanner companies. They tend to be focused on finding a place to store the mountains of scans the scanner companies'  devices can produce. They can be difficult to use and have a number of limitations compared with general document management solutions. As a result many users of these systems would like to switch to more powerful and comprehensive products like SharepointFileHold, or LaserFiche.

The problem with making the switch is getting the thousands of documents from the old system into the new one. In the case of ImageWare there are some complications. All the documents are stored in a proprietary volume block file format with a dot IMG extension. While Canon does provide a method for exporting these in their original source format it requires a fair bit of manual effort for anything but the most simple repository and it does not export the documents' metadata.

The metadata is often the most important part of these documents. It is typically created using optical character recognition when the document is scanned or it is manually entered when the document is filed. Either way it is a valuable commodity that should not be lost. The good news is that the metadata is stored in a Microsoft SQL Server database. With the right technical skills the metadata can be extracted and prepared to import into a new document management system.

iWDM stores all its files in a cabinet. Each cabinet has one or more folders. The folders can be nested. Underneath the covers the cabinets are stored in the file system the cabinets in a folder called iW DM Cabinet. Each cabinet is stored in a sub-folder named Cabinetx where the x is a number that is incremented by the system. If your document repository is in the default location on your C drive and you have a single cabinet called Accounting you would find it in the following location.

C:\iW DM Cabinet\Cabinet1

You can find this information in the cabinet properties in iWDM user interface as the actual location and names are provided.

The Cabinet1 folder would have one or more sub-folders containing your files in the proprietary IMG format. At the root of the folder you will find the all important Microsoft database files. There should be four files with the extensions MDF or LDF. The MDF files are database files and the LDF files are the data log files. The file named iWDM_Accounting_Data.mdf is your accounting cabinet database. RM_Accounting.mdf is the database used for maintaining the full text search indexes.

After all that preamble we can get to the metadata, which is stored in the cabinet database. A quick view of the iWDM_Accounting_Data.mdf file with a tool like SysTools MDF Viewer will show a number of tables. The key table is Document. There is one row in this table for each document in the system. There are three key fields that will associate the database rows with the files that were exported. The first one is FolderIndex, the second is Name, and the third is Creator. The Creator is effectively the file extension such as .tif or .jpg. However, there is a special internal Canon image format with a creator value of .image. When these files are exported they will be converted into the image format you select.

Documents are stored in iWDM in a hierarchy that looks like a Windows folder structure with the cabinet at the root level. When you export a folder that structure is maintained when the documents are put into the Windows file system. The FolderIndex is the key to finding the folder the document will be in. It is a link into the Folder table. The Folder table includes the name of the folder and the tree structure. Folders have a FolderType column that can contain one of five values.

  • 0 - Cabinet
  • 2 - Main trash folder
  • 4 - Hidden user trash folder
  • 5 - Normal folder
  • 9 - Deleted folder

Note that trash folders are a special case of folders. They are created automatically as needed by the system. When a document is deleted it gets moved to a corresponding trash folder. It still exists in the Documents table, but it is not possible to export it any longer. When the document is deleted from the trash folder it is removed from the Documents table. The trash folder you see in the user interface contains a hidden trash folder for each user. When a document or folder moves to the trash folder the Location column gets changed from 0 to 2.

All that remains to match the document in ImageWare to the file that was exported is the filename. The filename in iWDM is stored in two columns. The base name is in the Name field and the file extension is in the Creator field.

Now that we have the basic relationship between the exported file and the database we can start to find the associated metadata. There are three sources of metadata in iWDM: document properties, system index, and user index. The first two can be found in the Document table. Document properties like author or create date are available as columns of this table. System index refers to three predefined category fields. An index to the category values for each document is stored as Category1, Category2, and Category3 fields with each document. The index provides a reference to the values in the Category table.

The user index is slightly more complicated. The Document table provides no direct link to the user index. This work is done by the DocUserIndex table that provides a multi-way link to the Document table, the UserIndex table, and then to one of several user index data tables. There are eight different types of user indexes and six corresponding value tables.

User Index Data TypeValue TableUser Index Type
fixed stringFixedStringIndexValue
0
fixed maximum stringFixedStringIndexValue
1
variable stringStringIndexValue
2
dateDateIndexValue
3
integerIntIndexValue
4
unsigned integerIntIndexValue
5
floating point decimalFloatIndexValue
6
booleanBoolIndexValue
7

There is a special case when a user index has been defined as a "selectable list". In this instance the administrator predefines the possible values when the system is setup. The user can only choose from the list when they set the value of the user index. These fields have the UserIndexValueType set to 1; all other types are set to 0. The corresponding value table contains each possible value in the selectable list regardless of whether or not any documents have been assigned the value. Boolean indexes are a special case as they are always a selectable list with the values TRUE and FALSE predefined.

As an example, the following SQL query will return all string user index values for the given document:


SELECT d.Name 'Doc Name', f.Name 'Folder Name', 
   u.Name 'Index Name', u.UserIndexType 'Type', 
   fs.Value 'Fstr (0-1)', s.Value 'Str (2)'
FROM  DocUserIndex di 
   LEFT JOIN Document d on di.DocumentIndex = d.DocumentIndex 
      AND di.FolderIndex = d.FolderIndex
   LEFT JOIN Folder f on f.FolderIndex = d.FolderIndex
   LEFT JOIN UserIndex u on u.UserIndexId = di.UserIndexId
   LEFT JOIN FixedStringIndexValue fs on di.ValueId = fs.ValueId 
      AND di.UserIndexId = fs.UserIndexId
   LEFT JOIN StringIndexValue s on di.ValueId = s.ValueId 
      AND di.UserIndexId = s.UserIndexId
   WHERE d.Name = 'document name' AND f.Name = 'folder name'


That is all that is needed to extract the metadata for each of your documents. Each comprehensive document management system has its own method of importing documents and metadata. For example, in FileHold you would create a document.xml file with the metadata and the document locations at the root document folder and import it using managed imports.

It is unfortunate that the documents are stored in a proprietary format. This makes it difficult to automate the entire process. It appears as if the only change to the file is that iWDM adds 54 bytes to the front of the file. It may just be a simple matter of stripping this data off, but that is an investigation for another day. If you do export the documents using the iWDM export function you will likely need to reformat or compress the output files using a tool like Batch TIFF Resizer or TIFF Junction.

[Update]

I have taken a little closer look at the proprietary IMG files. As I suspected, removing the header (55 characters) of the IMG will reveal the embedded file. At least this is true for one type of IMG file; it turns out there are two: modifiable and non-modifiable. The modifiable version contains a single file and striping the first 55 bytes off will give the original. The non-modifiable format can contain multiple files and the 55 byte rule only works if there is exactly one file in the IMG.

I did dig into the format a little deeper and there are a few interesting titbits.

The first 32 bytes of the file appear to be the volume name; a 31 character ASCII null terminated string. This is followed by 5 bytes whose purpose is not known to me. Next we have what appears to be four 32 bit integers: a pointer to the end of the file, the IMG file number, the length of the first embedded file, and a pointer to the IMG file number (always seems to be 41); little endian format. Finally we have "VU" (hex 5655).

In the case of the modifiable IMG files the image file number is translated to base 36 and used for the file name. For the non-modifiable format the image file number relates to the file number in the IMG file. The file numbers start at 0.

For the non-modifiable format there is a secondary header at the start of the second and subsequent files. It starts with three 32 bit integers. The first one is the IMG file number, the second is the length of the embedded file, and the third is a pointer to the IMG file number. As before, the header ends with "VU". I have not investigated, but I suspect there is a flag in the header somewhere for deleted files.

There is an option to encrypt the IMG file. If encryption is in place, all bets are off for recovering the embedded files.

For the embedded files the format seems unchanged with the exception of images. If the images are converted to the iWDM image format they still seem to be stored their original format. You can check the file signatures at www.filesignatures.net or similar sites to confirm this even though you have lost the original image type in the database. Interestingly binders are stored as self extracting executables. When you run the binder code it extracts a PDF file with each of the images in the binder on each page.

Tuesday, September 4, 2012

Capturing Metadata from Folder Hierarchies

Document management solutions have come a long way from a time when dropping critical files into a network folder was state-of-the-art. Common solutions like Sharepoint, FileHold, LaserFiche, and others have been around for many years. Even with the availability of these systems, there are still large numbers of companies that store their documents in a hierarchy of folders.

Users of GUI operating systems like Windows have been trained for a few decades about how to use folders to store information. A big challenge with making the move to a real document management system is to get away from folder mind block. This is the condition where a user wants to have a folder for everything and put everything in its folder.

Using folders to organize documents creates a numbers challenges including the following:

  • The connection between the folder and document is tenuous. If the document gets moved or copied the information that the folder provided about the document is lost. 
  • The visual hierarchical nature of folders provides an impediment to storing the documents. The user must find the right spot to drop the file. Folders tend to look the same; choosing the wrong one is easy. Or, one slip of the mouse and the document is dropped in the folder next to the intended folder.

Document management systems (DMS) tend to use methods other than folders for storing and retrieving documents. Metadata is the most universal of these. Other names for metadata include tags, labels, and properties.

Metadata is information that describes a document. For instance, a document could be a project plan. If the project plan document had a metadata field describing the document type and the value was project plan, it would be very easy to find this document in a search for project plans. Additional metadata could include fields that describe the project name, client name, or whether or not the document was a final or a draft version.

Some metadata is explicit and some is implicit. Explicit metadata is expressly defined for the document and implicit metadata is derived. The previous metadata fields could all be considered explicit. Implicit metadata could include the file type (Word document, JPEG image, etc.), the number of words in the document, or the last user to modify the document.

When documents are moved from a folder hierarchy to a DMS, metadata can be implied from the structure of folders. It possible to preserve this implied metadata when documents are moved to a DMS. This enables users to find documents using the same information as before while using a more efficient document repository. The following example demonstrates this. The Excel document plan.xlsx in the legacy file share has the following path:

\\fileshare\Projects\Monkey Express\PRJ089\Project Management\Project Plan

From this structure we can imply the following metadata:
  • Client = Monkey Express
  • Project Code = PRJ089
  • Document Type =  Project Plan

A DMS has methods to import this metadata when the document is moved to the new repository. Once imported, a user can easily find all the project plans in the repository, only the projects plans for Monkey Express, or the specific project plan for project PRJ089 using the DMS search capabilities regardless of where the plan is stored in the DMS repository.

Sunday, July 1, 2012

Shoot Me Now!

It has been a bad week for things breaking. I found out my Nikon 70-200AFS f/2.8 lens, which stopped focusing, was not going to be repaired under warranty ($600 plus tax); the air conditioner stopped working a week after getting repaired; something got stuck in the central vacuum pipes in the wall and it lost suction; and, my photo edit suite workstation died.

Despite feeling the weight of an old Vista install, the edit suite had been running fairly well after I solved an intermittent reboot problem about a year ago. Then my wife reported a reboot while she was working the other day. I did not think too much of it until I saw it stuck on the RAID controller BIOS page the following day; one of my striped arrays had a drive with the word "ERROR" next to it. As it happened it was the OS drive, so I was completely dead.

After playing with the computer for a while it seemed clear the error was not a one off situation, so I went looking for the Western Digital drive diagnostics (for my Raptor drives) to be sure what was working and what was not. I needed a DOS bootable USB stick to run the thing, which turned out to be an annoying problem to create from my other Vista computer. I eventually got it sorted out and determined that one drive was dead (time to see how the 5 year WD warranty works) and the others were fine.

I decided a quick fix would be to simply loose the broken RAID array and go with a mono drive for the time being. I went to grab my Vista install disk only to be reminded that I had originally purchased a 32 bit OEM disk and downloaded the 64 bit version, which I then installed. Not only could I not find the backup DVD I created for the downloaded version, but I could not find anywhere at Microsoft where I could download (no MSDN subscription) a new copy. Since I was having to go through the pain of a clean install anyway, I thought I would say goodbye to Vista and hello to Windows 7. Off to buy a new license.

Up to this point, my elapsed time on this whole project till now is about 3 days. Mostly because I have had other things to do, but there has also been a fair bit of time sitting in front of a screen waiting and or scratching my head. I was pretty happy when I could finally say I had all my ducks in a row and could begin my clean Windows 7 install. After a fairly quick book from the install DVD, I was surprised to find that it seemed to know all about my Intel(R) Matrix Storage Manager RAID controller as all my drives appeared on the install menu. I saved finding and loading the drivers. My pleased look soon wiped away as a new problem seemed to be forming.

Two of the four partitions displayed had a note that said they were not compatible with being a Windows system drive. This was okay as this note was not present on the one I really wanted to use. However, when I tried to go to the next install step I got an error: Setup was unable to create a new system partition or locate an existing system partition. It seemed simple enough at first until I realized that no deleting, formatting, or creating new partition from the menu would help me.

A quick search on Google demonstrated that I was not alone, but nothing suggested seemed to help. I thought there may have been an issue left over from the RAID controller. I used the WD diags to clear the drive. SHIFT-F10 brought me to a command line where I could use DISKPART to try and clean things up, but no help there either. There were recommendations to disconnect all other drives, but that did not seem to help. I installed a drive I had not been using thinking there may still be some residual RAID "effect", but no.

Many hours and many more reboots had passed. I was just about to dump the SATA ports on my ASUS motherboard and find an old IDE drive to try when I thought I would give Google one more shot. I dug a little deeper through the search this time and found more of the same stuff, but one very short post solved my problem. It repeated an old refrain, "unplug all your other drives", with one minor difference. In parenthesis it said "external" and "flash". I looked down and saw that my USB stick was still installed from days before when I was still testing the failed drive. I pulled it out, clicked NEXT and nearly fell off my chair when the Windows 7 install continued! I guess a drive is a drive is a drive. (I did not need to unplug the DVD drive with the install DVD in it though, so maybe all drives are not truly equal). Doh!

As a side note, I found a great piece of software during this ordeal. TestDisk is used to repair problem drive partitions among other things. I had to break a RAID 0 array in the course of things and TestDisk repaired it perfectly. I would tell you about how I solved the making-a-bootable-DOS-USB-stick-under-Vista problem, but it was more fluke than process and I had no energy to go back and recreate it, so I will be labeling and keeping my DOS stick intact.