[Research] iKnow and algorithms.
Hello!
My group and I are currently doing a research project on natural language processing and iKnow plays a big role in this project. I am aware that the algorithms iKnow use aren't public, and I respect that.
My question is, are there any public documents/research that explains, at least part of, the algorthims iKnow uses and the motivations for using them?
Here is a concrete example: We are using GetSimilar() for many of our results and it works very well. As the documentation states we can choose to look att entites, crc, or cc and choose between the algorithms SIMSRCSIMPLE and SIMSRCDOMENTS.
- Are there any additional information about SIMSRCSIMPLE and SIMSRCDOMENTS other than the GetSimilar documentation?
- How are partial matches handled? E.g. the CRCs "He is happy" and "She is happy". Will they get some points or none?
I apologize if this is the wrong place to ask but I've gotten so much great feedback here before so I thought it was worth a try.
Thanks!
You can open this (any) method in Studio and see the definition (with some rare exceptions, in iKnow package only %iKnow.TextTransformation.HeaderRepositorySetArray and %iKnow.TextTransformation.KeyRepositorySetArray classes are not availible). It's the best way to get an idea of how method works and the code usually even has comments.
Scrapped from GetSimilar():
Hi Benjamin,
The (patented) magic of iKnow is the way how it identifies concepts in sentences and happens in a library shipped as a binary, which we refer to as the iKnow engine and is used by both the iKnow APIs and iFind indices. Most of what happens with that engine's output is not nearly as much rocket science and as Eduard indicated, its COS source code can usually be consulted for clues on how it works if you're adventurous.
The two options of the GetSimilar() query both work by looking at the top concepts of the reference source and look for other sources that have them as well, using frequency and dominance for weighting in-source relevance for the two options respectively. So not much rocket science and only support for full matches at this point.
This said, iKnow offers you the building blocks to build much more advanced things, quite possibly inspired by your academical research, leveraging the concept level that is unique to iKnow in identifying what a text is really about. For example, you can build vectors containing entity frequency or dominance and look for cosine similarity in this vector space, or you can leverage topic modelling, but many of these will require quite a bit of computation and actual result quality may depend a bit on the nature of the texts you're dealing with, which is why we chose to stick to very simple things in the kit for now.
However, you can find two (slightly) more advanced options in demos we have published online:
In both cases, there's a myriad of options to refine these algorithms, but all at a certain compute cost, given the high dimensionality introduced by the iKnow entity (and actually even word) level. If you have further ideas or, better yet, sample code to achieve better similar document lists, we'd be thrilled to read about it here on the community ;o)
Thanks,
benjamin