File Naming, GUIDs, Duplication, Identity and Metadata: A Response to John O’Gorman

by

I love social network conversations. There has been a great one going on over in LinkedIn. In the AIIM Group for Intelligent Information Management a conversation was started around whether or not file naming conventions were needed when we have robust EDMSs (enterprise document management systems). John O’Gorman, an Information Integration specialist made a provocative post (in the spirit of great dialog) and I responded. The answers and debate have grown and now, rather than take up the whole form, I am posting my reply here, so that you may participate as well!

If you have not read the thread, you can do so HERE (if you want to skip to the billy vs john debate go to page 3). Without further ado, here is my reply:

I appreciate the engagement and invite others into the fray. I think this makes us all sharper! So in the spirit of mutual enlightenment and the disputational interrogative we engage!

We start off on common ground agreeing that humans are *much* better suited at pattern recognition and discrimination than are programs. While they can process vastly larger quantities of information, we can identify and consume “relevant” information more efficiently (at least now).

1) You mention that a computer cannot have even 2 files with the same name in a folder while we can pick out one from many quite easily. I agree with the example you give but the question was not answered and your answer imports some assumptions that aren’t necessarily so. Let me explain. I would argue (based on my pop-sci understanding) that human discrimination is facilitated by the way our brains “tag” memories with unique identifiers. We use electro-chemical “naming conventions” programs use other conventions. Same fundamental strategy though. To this extent ,the argument that the strategy that computer programs use is bad because it is different than what human brains use fails. Showing a difference in result does not impugn the process, merely the efficiency or execution or a host of other limiting factors. Secondly your example imports some assumptions that do not hold. Why do you assume the windows file system uniqueness requirements? I work with EDMSs that can store N number of files with the same file name in the same “location” and display those in a single collection and slap a “folder” icon on top of it. Computer programs? Yep. Windows file system limits? nope.

2) Maybe I didn’t understand when you originally stated, “the reason we put meaningful labels on anything is because no one has come up with a better alternative”. This sounded to me like you considered this the worst of no possible alternatives. I am not sure why you feel this way. As you say, it seems to work pretty well for our brains and while search isn’t up to our levels yet, it is getting there. I would also argue that basic keyword indexing (tokenization) has limitations. But aren’t you creating a straw man argument here? Why must disambiguation happen at this level? Search engines also incorporate (and are increasing their incorporation of) other weighting/prioritization/relevancy axes in order to achieve disambiguation and increase relevancy. This is why entity extraction and ontology assisted querying is so promising. Before going there though, disambiguation and prioritization happens through incorporating inbound linking, folksonomies, usage/consumption patterns and other factors. Bibliographic tracing is quite popular in higher education systems in order to figure out which concepts are derivative and which are foundational. Computer programs are doing the tracing and therefore the discrimination here.

3) I grant that, as you say, two different GUIDs only assert that the two resources to which they are associated are deemed different. But again you brought up the distinction between systems and humans. A picture of (the same) JaneSmith (age 3) and JaneSmith (age 33) are very different and *depending on your prior relationship with here*. The difference may be enough to prevent your identification of her as the same person or (if you are her father) not enough to confuse you even for a moment that it is the same person. This raises two very important questions around meaningful (aka relevant) difference and identity. With a person we presume continuity of something that transcends cellular existence (personality, soul, spirit, whatever). With information what do we have? Changing a single byte in a document between versions alters checksum values thereby creating an entirely new item (by one interpretation). But that one byte is not likely a meaningful difference and so we are comfortable with maintaining the ID of the item. So hyper content centric identity of information seems not to be useful. Alternately I can create an empty “unique id” for an item in my EDMS and then proceed to associate that ID with an image of JaneSmith, then a resume for JaneSmith then a video of the Space Shuttle. That unique ID retains its uniqueness among all other items in the system but is a container for “nothing – image – document – video”. The ID is still able to be distinguished from all others in the system. The difference is the intentionality with which it is assigned. Humans import definition by creating the ID and therefor create the uniquess regardless of what the “thing” is the identifier points to. This is the extrinsic identification rather than intrinsic identification. I think there is a place for both kinds of identity in our world but EDMSs generally focus on and utilize extrinsic identity. This is because we can import purpose more easily to extrinsically identified objects (since we control them) rather than intrinsically identified objects (which we have to find a use for).

4) I agree whole-heartedly that, as you say “There is some philosophy in every human endeavor, just as there are some mathematics in every computer”. I, like you enjoy engaging on this level as well! So props to us and the readers who enjoy it. I think this is an important area where we can elevate the discussion and practice in our community. So I’ll bite again. While I love talking about how interior angles of a triangle can add up to 270 when laid out on a sphere, I’m not seeing the information science analogy but I am excited to hear it! In the mean time, I do not follow your LA example. You say that “without being told ‘Los Angeles’ (GUID h34dh23a4b7b33c8361) is different than ‘Los Angeles’ (GUID 7b33c8361h34dh23a4b) a computer is ignorant of that difference.” But I would reply that the *fact* of the different GUIDs is the discriminating factor for a computer. So by definition a computer *must* “know” that title attribute “Los Angeles” of GUID 1 is different and distinct from title attribute “Los Angeles” of GUID 2. The trouble comes up in how those GUIDs were assigned. If extrinsically assigned (i.e. though an act of intention) then we the human consumers of that information assume/rely on the idea that the difference is meaningful. If intrinsically assigned (e.g. automatically via a crawler or something else) then we get into potential duplication / overlap / synonym problems (e.g the checksums were different but the objects were not meaningfully different). The trouble I have with this is that meaning is always imported by humans and is dependent on the scope of our problem domain. Google Earth can show the globe or my back yard. At one scope of the problem domain (e.g. where do you live?) both the “glob” and “my back yard” answers are correct. To a space alien the globe provides the appropriate level of meaning. To my mother the back yard provides the appropriate level of meaning. So by setting up the question the way you have you seem to be assuming a common scope which is only ever extrinsically identified and therefore unable to be held in common.

5) I think the “assigned to maintain” vs “derived to describe” strategy is very good and the best part is that they are not mutually exclusive. I agree with you that these are, “an interesting twist on randomly assigned object identifiers”. But these are quite common. They are simply at different levels of the problem domain. OWLs, most EDMSs and other relationship maps do this all the time. Using your example, ‘Management Salary Policy’ and ‘Management Gehalts Politik’ and ‘Política del Sueldo de la Gerencia’ and ‘???????? ???????? ??????????’ would share a common identifier that acts as a meta-identifier. The reason for this is that (I assume) you are using a localization example where we have 4 different translations of the same policy. In this case we humans understand that the collection, the set, is a common set and should be related and identified with a common identifier. The difference at the set level is not meaningful. At a deeper level the language difference becomes meaningful only after the set has been identified and located. Here is where we start delving into Derridan concepts of differance and what is meant by the identity. Suffice it to say that this is the realm of extrinsicly assigned (e.g. derived to described) identification.

6) Here we’re getting down to brass tacks. I agree that search is a big pain point for most organizations. AIIM, Gartner, Forrester, Ovum, Gilbane and others all agree. But in most EDMs don’t care if multiple files with the same name are stored. I should be clear, by files with the same name I mean “file name” (e.g. BlogPost.doc or MyPresentation_Final.pptx or LinkedInReply.txt). This is because the EDMSs will store that filename as an attribute but give the file a GUID. The good EDMSs (like Oracle UCM) will provide a set identifier (ContentID) that identifies the set of revisions which may be substantially different from each other. Each revision has it’s own unique identifier (dID). Furthermore, ContentIDs can be auto generated, auto derived from rules / extraction processes or manually created each and every time. Additionally content objects can be associated with each other along N number of axes for intentional purpose-based collecting/discovering/location/identification. These axes may or may not be indexable. This allows information discovery (classification, grouping, categorization) to be brought along side information location (querying, retrieving) rather than relying simply on one or the other or requiring a serial approach of one then the other. I fundamentally disagree with your statement that, “nor is it considered best practice in an EDMS to encourage or even allow contributors to randomly assign their own metadata.” First, no user ever “randomly” assigns their own metadata. At least not in true random form. Second your statement begs the question of what is metadata and what should end users create and consume? So are Flickr tags metadata? Yes. Should users create them and be enabled to create them? YES! Are star based ratings on blog posts metadata? YES. Should users be allowed and empowered to engage with rating systems? YES! What about “comments” or “descriptions” or “due date”? I would argue that in contextually appropriate situations users should always have at least the option if not the requirement to add descriptive and intentional metadata to their content objects. The more that entity extraction systems (e.g. OpenCalais, GATE, CLARABRIDGE etc) become commonplace the easier we can make it for people. But I do not foresee a time when information creators will be able or should stop describing and classifying what they have created or consumed.

7) You write that “This issue of whether or not to have a convention for naming files is symptomatic of a systemic problem.” I agree whole heartedly. But I disagree that this means that all systems fail to solve the problem. Indeed I am very confident that with technologies like the Oracle ECM system and Fishbowl Solutions add on modules such as our Subscription notifier, Workflow solution set, CollabPoint and Advanced User Security Mapping along with our solutions like Contract Management, Policies and Procedures, Admissions Office Onboarding and Research Solutions that we can and do solve the business problems around file naming conventions, information location and efficiency boosting.

So I will echo your sentiment at the end, with which I agree unequivocally: In the spirit of the community of intelligent information management,I’m just sayin’…

Advertisement

11 Responses to “File Naming, GUIDs, Duplication, Identity and Metadata: A Response to John O’Gorman”

  1. Deborah Johnson Says:

    I am all about word sense disambiguation: I think you would enjoy looking at Cognition Technologies.

    One of the biggest problems with every other semantic search technology is their inability to disambiguate words within context. As an example, most won’t understand that “bat in cave” is about a flying furry creature, or they also retrieve “baseball bat”?

    The key differentiators for Cognition’s semantic technologies is their ability to (1) understand word meaning (think of the Semantic Map as a giant dictionary), (2) their ability to disambiguate word meanings within the context of how they are used (the “bat in cave” example); and (3) and their understanding of synonymy. There is no other semantic technology available which can do all three of these simultaneously. As a result, CognitionSearch (i.e. the Search application) can simultaneously improve both precision and recall ratios.

    http://library2.lawschool.cornell.edu/insiteasp/public/display_browse.asp?style=st_browse&id=1712&prevpage=3ccls

  2. John O'Gorman Says:

    Hello again;

    Sorry for the long cycles here, Billy – my son had his eighteenth birthday this weekend and last week was a gong show, but for entirely different reasons. I will try to tighten up the response times…

    I was going to go through your responses (1-7) individually, but since we seem to be on fairly level ground, I thought a few general observations might be more helpful.

    The naming conventions issue is one where the demands of the digital reality meet the needs of human control. I still can’t get my Windows Explorer to accept two readme.txt files in the same folder and although Oracle or Documentum can programmatically put two identical files in the sme ‘location’ it makes for some pretty confusing search results without more programming.

    Most people can visually discriminate two things – it’s a survival imperative – and so lacking a graphic representation of a piece of unstructured content, we tend to want to label them in a way that is recognizable to us. A computer has the opposite problem: unless programmed to make two identical things ‘different’ it will treat ‘readme.txt’ (or any two matching literals) as duplicates.

    Indexing, and by extension search is a good example of where the disconnect between humans and machines occurs on a regular basis.

    To sum up, I would say that the time is quickly approaching where there is so much information out there that the current digital constraints will have to change without programming for individual instances. In other words, we need to adapt our current thinking to accommodate an infinite variety of data inside a more abstract and therefore compact model that reflects the way humans acquire, manipulate and represent information.

    Great conversation, I hope we can continue.

  3. John O'Gorman Says:

    Hello again;

    Now that the ‘party’ is behind me, I want to revisit the last post on your blog. (My son turned 18 this past week, and he somehow wangled permission to throw a keg party for 40 or so of his friends. I didn’t think I would be, but I was a tad stressed about it, so forgive me if my response to your invitation was slow or my response to you blog appeared to give short shrift to your excellent work.)

    1) I used the photos example to demonstrate a point you touch on later in your response. The identity of the subject in the photo is a thing separate from the identity we assign to the digital asset itself. However, if the name of the digital asset uses the ‘name’ (of the person) instead of a unique identifier assigned to that person, there is a problem. I understand that an EDMS can twist this a bit to make it appear that N copies of the same file can exist in a virtual folder, but that cause cognitive overhead of a different kind.

    2) I guess what I was trying to get at here was the reason file naming is so difficult to get rid of. In the picture example, I want to know who the subject of the picture is without opening it, so I give it a bunch of strings in the file name because right now a file name is the most visible point of access. Same with content. The common thread through all my arguments is that indexing content only goes so far, and entity extraction is a great example of how much effort goes into inferring identity to overcome indexing inadequacies.

    3) A couple of things here. If I have a unique identifier for JaneSmith from point 1), above, then a temporally segregated series of pictures of JaneSmith is easy to manage. In fact the rules are the same: 34 different pictures of 1 Jane Smith are handled no differently than 34 different pictures of 34 different Jane Smiths. Think about how OLPA is done and you’ll see what I mean. Your example of one character in a document makes the same point: I have a new digital asset with exactly the same meaning – or not.

    4) I read a book a number of years ago (and have re-read it a number of times since) that talks about man’s progression through various levels of complex mathematics, particularly the solutions to five different types of equations. Linear, quadratic, cubic and quartic equations all have general solutions that arose from the use of more equations. The quintic, however, does not. As the book title suggests, it was ‘The Equation That Couldn’t be Solved’. The book is about symmetry and how the new geometry by Rheimann and Klein (I’ll have to check my accuracy here) was used to prove that equations to the power of five do not have a general equation-based solution. I may not get this next part exactly right, but the eliptical geometry was instrumental in supporting Einstein’s idea of warped spacetime, and a number of other radical ideas. What I was trying to suggest is that computers currently use a relatively ‘flat’ geometry, and that a new one is needed to take us to the next level.
    I think you make the same point with the GUIDs example as I did, but in a different way. By extrinsically assigning two different identifiers I am explicitly telling an application that although the strings are identical, their meaning is not.

    5) The localization example is similar to the photo case. Assuming that one document is the key and others translated directly from it, the subject in all of the documents is the same, even while the content is guaranteed to be different. An interesting variant of this is that the content can have the same title even though the contents are locale-specific; in other words the majority of the document is a direct translation while certain sections are orginal.

    6) Agreed on most of the metadata points, and I like what you’re saying on Oracle UCM, but most otherwise competent analysts and users have no idea what constitutes a good sample of metadata. For example, ask a content analyst whether their metadata refers to the digital asset or the subject, and he’ll look at you like you were crazy. Most don’t make that separation. Entity extraction is another area where current practice is a bit weird. Why, for example, would software go looking to extract entities from a collection of content when they can find all of the entities a company will likely ever use in their databases?

    7) Going back to the symmetry points, yes there are systems that solve the problem, but they are different iterations of the same type of thinking. That thinking is, in my opinion, as a direct result of the way computers process strings.

  4. Samara Cumblidge Says:

    There is stll a whole lot to recognise close to this kind of information. My family and i feel you developed a lot excellent ideas

  5. rsplg.atspace.com Says:

    Thanks for your helpfull information
    I just start a new site its to Generate Rapidshare, Megaupload, Hotfie Premium Link e.g. it give you a direct download link from those sites
    So you are very welcome to see it and give a opinion about it

  6. Eloy Monsalve Says:

    But it is the high quality, not quantity, of the links which is crucial. The other websites needs to be relevant to your business, and preferably highly considered themselves.

  7. frenky Says:

    h4WAXO http://gdjI3b7VaWpU1m0dGpvjRrcu9Fk.com

  8. Mnupdrtj Says:

    How many would you like? anita teen model 8-[[

  9. Akmztisl Says:

    It’s funny goodluck nymphets family
    >:]]]

  10. Xxqpggba Says:

    I’ll put him on underage girl
    %O

  11. Nbffwxpy Says:

    How do you do? Ls Bbs 550540

Leave a Reply

Fill in your details below or click an icon to log in:

Gravatar
WordPress.com Logo

You are commenting using your WordPress.com account. Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.