浏览模式: 普通 | 列表 
 

新年新计划

  发现每次放假前博客都几近废弃,杂草满天飞。尤其是这次,刚刚换到Wordpress,给无穷无尽的国外Comment Spam者钻了空子,加之刚刚升学到研究生,BT的课业也让我无暇除草。经过两天的努力终于把博客升级、更新、加上防护了,希望奏效吧,在被攻破我就不知道该怎么办了……

  好了,总结一下近期的情况,回家的主要任务是养病,基本就是宅在家里了,并且除了上网乱搞没进行什么有意义的活动呵呵。中间和丫丫滑雪一次,还不错呢~

  总结完毕。

  前两天在网上乱逛的时候看到一个博友制订了一个比较苛刻的新年计划,感觉不错。于是,便有今天的想法,把从3月1日起的接下来一个学期的计划晒出来,以便诸位博友和丫丫共同监督~

一、主业

  1. 在生日前阅读有关IR的至少20篇论文或者资料,让学术水平小上一个台阶,也算是给自己的一份小小的生日礼物吧。
  2. 肩负起实验室Wiki的重任,并勇于承担语言模型(LM)的内容。
  3. 利用业余时间在期中前将实验室的主页更新,也算是还欠了老板许久的一个债吧~

二、副业

  1. 坚持更新和维护博客以及一些相关的空间,每周至少2次维护博客、回复评论、联系博友,拒绝垃圾,拒绝长草~
  2. 暂时性摒弃.Net,猛学C++半年。
  3. 研究Lemur和Indri,更新IR相关日志5篇。
  4. 继续在摸索中熟习Win7系统,更新PC相关日志5篇。
  5. 由于前一段时间的努力,暂时不需为Money担心,暂停接活,陆续清理手头工作,专心积淀。

三、业余

  1. 陪丫丫继续关注《Criminal Minds》、《Lie to me》、《火影忍者》、《死神》,同时看10部电影,并更新影评~(貌似哈利波特将在暑假档期更新,期待中)
  2. 继续提高羽毛球单双打水平,争取在接下来的比赛中拿到好的成绩。在单打上争取追上侃歌,以备之后超越。
  3. 多多联系朋友,争取早日告别火星……

By Kevin

查看更多...

初识Lemur

介绍:

Lemur(狐猴)系统是CMU和UMass联合推出的一个用于自然语言模型和信息检索研究的系统。在这个系统上可以实现基于自然语言模型和传统的向量空间模型以及Okapi的ad hoc或者分布式检索,可以使用结构化查询,跨语言检索,过滤,聚类等等。目前最新的版本是3.0,CMU和UMass在9月将推出新的版本Indri(大狐猴),将加入支持terabyte(1000G就是1T)的数据库和结构化的文档查询(比如将html文档解析为不同的doc representation方式,利用html文档的结构表达方式信息tag, title, meta等)。

运行Lemur需要什么?Lemur可以在windows或者Unix环境下使用,因此我们可以直接在windows下使用lemur。但是lemur提供了shell script文件来演示完整的使用lemur进行检索的过程,所以在windows下需要安装cygwin来模拟Unix环境。Lemur还提供了一个GUI程序以及用户交互的界面的CGI,其中有Java程序可以直接看到检索的结果,,因此需要安装Java 虚拟机,CGI程序需要Perl的解释器

下载网址:http://www.lemurproject.org/

点击左侧lemur,可以看到4.3到最新版本;

Indri-2.10-install.exe

Indri安装文件

i386

Indri-2.10.tar.gz

源文件

Platform-Independent

lemur-4.10.dmg

MAC系统

i386

lemur-4.10-doc.tar.gz

接口文档

Platform-Independent

lemur-4.10-install.exe

lemur安装文件

i386

lemur-4.10.tar.gz

源文件

Platform-Independent

下载lemur-4.10-install.exe并安装即可。

目录介绍:..\Lemur 4.10\

bin\Lemur Toolkit applications供直接调用的应用程序脚本即命令行方式,详见windoc\lemur-applications.html

include\The lemur include files

lib\the lemur library

windoc\Overview of the Lemur Toolkit

src_vs_2005\基于MS平台的完整Lemur Toolkit源码

javadoc\java API document

GUI\

RetUI.jar provides a basic document retrieval GUI for interactive queries, using the Indri API.

IndexUI.jar provides a basic collection indexing GUI for building an indri repository. LemurRet.jar provides a basic document retrieval GUI for interactive queries using the Lemur API.

LemurIndex.jar provides a basic collection indexing GUI for building Lemur indexes.

lemur.jar and indri.jar for the Lemur and Indri APIS.

doc\ Lemur Toolkit Documentation 如:

Namespace List | Class Hierarchy | Alphabetical List | Class List | Directories | File List | Namespace Members | Class Members | File Members | Related Pages

CSharp\The C# wrapper classes assembly will be in LemurCsharp.dll This assembly should be referenced by your C# program.

使用方式:

(1)直接拿lemur的程序来使用,即bin\下的可执行程序;

(2)Building applications using Visual Studio .NET即直接在自己的项目中调用Lemur库等;

After installing the lemur toolkit, you can use the library by adding the subfolder include of the target directory to the “C/C++ / General / Additional Include Directories” property for your project:

Next, add the subfolder lib of the target directory to the “Linker / General / Additional Library Directories” property for your project:

Next, add lemur.lib and wsock32.lib to the “Linker / Input / Additional Dependencies” property for your project.

Also, if your project is configured as “Debug”, you should choose the “Multi-threaded Debug DLL(/MDd)” runtime library. If your project is configured as “Release”, you should choose the “Multi-threaded DLL(/MD)” runtime library. The installable Lemur Library and applications were built in Release / Multi-Threaded mode.

Finally, you should have C/C++ Language Enable Run-Time Type Info set to yes.

(3)Compiling the Lemur Toolkit with Visual Studio .NET即对lemur进行修改以符合自己的要求,然后重新编译再调用;

The installer can optionally install the full Lemur Toolkit source tree, placing it in the “src_vs_2003″ subfolder and/or the “src_vs_2005″ subfolder of the target directory, depending on which version(s) of Visual Studio you have installed. That folder contains the Visual Studio solution file “Lemur.sln”. There is a separate project file for each library and for each application in Lemur.

By default the project configurations are built in “Debug” mode. To change this so that it compiles with fewer warnings and runs at higher efficiency, change the configuration setting in the “Build” menu. Then choose “Configuration Manager”. In the menu for “Active Solution Configuration”, choose “Release”.

When built from source, there is a separate library for each of the sub-libraries that are compiled into “lemur.lib”. The combined library, “lemur.lib”, is built in the lemur subfolder, with output in either Release or Debug, depending on configuration.

Important Note: 1。Before compiling the toolkit from the source, you must set the proper include path for the Java library. To modify the library, in the Solution Explorer view, right-click on the “lemur_jni” project and choose “Properties”. Set the “Configuration” drop-down box (at the top of the dialog box) to “All Configurations”. Next, in the “Additional Include Directories” field, set the appropriate paths to your Java JDK installation’s include directory and include/win32 directory. Press the “OK” button when finished, and rebuild. [如果依然不能找到file: 'jni.h',则分别将JDK的include和win32也加入到Additional Include Directories] 2。防止出现类似 error PRJ0008 : 未能删除文件“e:\lemur 4.8\src_vs_2005\app\obj\vc80.pdb”或者不能打开等, 进行设置:即parallel project builds 问题,设maximum number of parallel project builds为1。(双核以上CPU问题?)

备注:

因为lemur有对于阿拉伯文的支持,而在中文系统当中可能会出现字符编码的问题。所以,需要屏蔽掉涉及到阿拉伯文处理的模块。找到parsing模块下的Arabic_Stemmer.cpp文件,将其中的函数内容全部屏蔽为空。对于返回类型为void型函数,将函数体内容全部注释,对于有返回类型的函数将整个函数全部注释掉。注意,这里不可删除模块的内容,因为其它的模块会调用相关的接口,如果屏蔽掉接口会导致程序无法通过编译。

使用参考文档:

Lemur Toolkit and Indri Search Engine Documentation

http://www.lemurproject.org/docs/index.php/Main_Page

PS:

千万不要使用lemur v4.10.1 以及 indir v2.10.1,经本人测试索引有极大问题(还我浪费了一个下午的时间),原因不明。x.10版本一切正常。

By Kevin Ma

查看更多...
当前为第1/1页1