제스트의 블로그 :: lucene 기초

« 2025/7 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

개발관련/Lucene,ElasticSearch 2015. 10. 30. 15:56

lucene 기초 - index

이번 글에선 색인에 대해 알아보자.

색인은 I am a boy. 라는 단어가 있으면 특정 분석기(analyzer)를 통해 분석이 되며 이 분석된 단어(Term)들이 색인 되어 진다. lucene은 rdb와 틀리게 증분색인으로 이루어져 있다.

WhiteSpaceAnalyzer 를 사용하면 위 문장은 I, Am, A, Boy로 분석이 된다.

StandardAnalyzer는 Boy 정도만 나올 것이다. 이유는 내부적으로 Stopword를 가지고있다.

결국 StandardAnalyzer는 WhiteSpaceTokenizer+StopFilter+lowercaseTokenizer 정도로 구성된다.

@ContextConfiguration(locations={
"file:src/main/webapp/WEB-INF/spring/root-context.xml"})
@RunWith(SpringJUnit4ClassRunner.class)
public class TestHtmlIndexer {
    private static final Logger logger = LoggerFactory.getLogger(TestHtmlIndexer.class);
    
    private Directory dir = null;
    
//    @Autowired
//    private WhitespaceAnalyzer whitespaceAnalyer;
    
    @Autowired
    private StandardAnalyzer standardAynalyzer;
    
    
    private CustomSimpleAnalyzer customAnalyzer;
    
    private WhitespaceAnalyzer whitespaceAnalyzer;
    
    private SimpleAnalyzer simpleAnalyzer;
    
    private IndexWriter writer;
    
    @Autowired
    private HtmlWithTikaParser htmlParser;
    
    @Autowired
    private TieredMergePolicy tmp;
    
    @Value("${fileindex}")
    private String path;
    
    
    @Before
    public void setup() throws IOException, InterruptedException{
        customAnalyzer = new CustomSimpleAnalyzer(Version.LUCENE_36);

        //저장 방식에는 많이쓰는 방식이  몇 가지를 지원하는데, NIODirctory, SimpleDirectory,FSDirectory, RAMDictory등
        //RAMDirtory - 메모리에 index를 저장 테스트시 많이 사용. 
        //NIODirectory - unix계열에서만 가능 하지만 버그가 있는것으로 알고있음. 
        //FSDirectory - 이걸 가장 많이 쓴다고 함. 
        //디렉토리를 open 
        dir = FSDirectory.open(new File(path));
        //어떤 색인으로 할것인지 대한 설정. 
        IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_36,
                customAnalyzer);
        
        //색인은 어떤 형태로 저장할지 에 대한 셋팅.
        //OpenMode.CREATE - 색인시마다 기존 색인 삭제 후 재 색인
        //OpenMode.CREATE_OR_APPEND - 기존 색인이 없으면 만들고, 있으면 append 함. 
        //OpenMode.APPEND - 기존 색인에 추가.
        iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
        //색인 파일의 병합 전략인데 사실 읽어봐도 이해를 못해서 걍 씀. 
        iwc.setMergePolicy(tmp);
        
        //lucene은 rdb와 다르게 lock에 대한 처리를 안해준다. 단지, index시 lock 파일이 존재하면 error가 발생한다.
        //이 때문에 필히 lock 체크를 해줘야함. 
        lockChecker();
        //드디어 indexwrite 생성. 디렉토리와 config를 매개변수로....
        writer = new IndexWriter(dir, iwc);
        
        
    }
    public void lockChecker() throws IOException, InterruptedException {
        //IndexWriter.WRITE_LOCK_NAME - 실제 index 시 directory에 보면 xxx.lock 파일이 존재하게 되는데 
        //존재 할 경우는 lock으로 판단. 
        while(dir.fileExists(IndexWriter.WRITE_LOCK_NAME)){
//            dir.clearLock(name);
            Thread.sleep(10);
        }
    }
    
    public void addDocument(HtmlDTO dto){
        try {
            
            writer.addDocument(dto.convetDocument());
        } catch (CorruptIndexException e) {
            // TODO Auto-generated catch block
            logger.error(e.getMessage());
        } catch (IOException e) {
            // TODO Auto-generated catch block
            logger.error(e.getMessage());
        }
    }
    
    
    
    public void writeClose() throws CorruptIndexException, IOException{
        if(writer != null){
            writer.close();
        }
    }
    
    
    @Test
    public void testAddDocument() throws CorruptIndexException, IOException, SAXException, TikaException{
        URL url = this.getClass().getClassLoader().getResource("html/xxx.json"); 
        String path = url.getPath();
        File file = new File(path);
        
        
        JSONParser parser = new JSONParser();
        try {
             
            Object obj = parser.parse(new FileReader(path));
     
            JSONObject jsonObject = (JSONObject) obj;
     
            
            Iterator<String> keys =  jsonObject.keySet().iterator();
            while(keys.hasNext()){
                String key = keys.next();
                JSONObject valueObj =  (JSONObject) jsonObject.get(key);
                String filepath = valueObj.get("FilePath").toString();
                String CATEGORY_TEXT_ID = valueObj.get("CATEGORY_TEXT_ID").toString();
                String breadcrumb = valueObj.get("Breadcrumb").toString();
                String CATEGORY_TREE = valueObj.get("CATEGORY_TREE").toString();
                String CATEGORY_ID = valueObj.get("CATEGORY_ID").toString();
                String LOCALE_KEY = valueObj.get("LOCALE_KEY").toString();
                String CATEGORY_TITLE = valueObj.get("CATEGORY_TITLE").toString();
                String CATEGORY_DESC = valueObj.get("CATEGORY_DESC").toString();
                
                
                HtmlDTO dto = new HtmlDTO();
                
                dto.setCategoryTextId(Integer.parseInt(CATEGORY_TEXT_ID));
                dto.setCategoryTree(CATEGORY_TREE);
                dto.setBreadcrumb(breadcrumb);
                dto.setCategoryId(Integer.parseInt(CATEGORY_ID));
                dto.setLocaleKey(LOCALE_KEY);
                dto.setCategoryTitle(CATEGORY_TITLE);
                dto.setCategoryDesc(CATEGORY_DESC);
                
                
                url = this.getClass().getClassLoader().getResource("html/"+filepath); // 이부분 수정. 
                ArrayList<String> list = htmlParser.htmlParser(url.getPath());
                
                dto.setText(list.get(0));
                dto.setHtml(list.get(1));
                
                addDocument(dto);
            }
            
     
        } catch (FileNotFoundException e) {
            logger.error(e.getMessage());
        } catch (IOException e) {
            logger.error(e.getMessage());
        } catch (ParseException e) {
            logger.error(e.getMessage());
        }
     
    }
    
    @After
    public void tearDown() throws CorruptIndexException, IOException{
        writeClose();
    }
    
}
 

Colored by Color Scripter

위 예제는 json을 읽어서 특정 경로에 있는 html을 색인하는 과정을 junit으로 해본것이다.

convertDocument() 는 다음과 같다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
..생략........
public Document convetDocument() {
        // TODO Auto-generated method stub
        
        Document doc = new Document();

        NumericField categoryTextIndex = new NumericField("categoryTextId",Field.Store.YES,true);
        categoryTextIndex.setIntValue(this.getCategoryTextId());        
        doc.add(categoryTextIndex);
//        
        NumericField categoryId = new NumericField("categoryId",Field.Store.YES,true);
        categoryId.setIntValue(this.getCategoryId());        
        doc.add(categoryId);    
        
        
        doc.add(new Field("categoryTree",this.getCategoryTree(),Field.Store.YES,Field.Index.NOT_ANALYZED ));
        doc.add(new Field("localeKey",this.getLocaleKey(),Field.Store.YES,Field.Index.NOT_ANALYZED ));
        doc.add(new Field("breadcrumb",this.getBreadcrumb(),Field.Store.YES,Field.Index.NOT_ANALYZED ));
        doc.add(new Field("categoryTitle",this.getCategoryTitle(),Field.Store.YES,Field.Index.NOT_ANALYZED ));
        doc.add(new Field("categoryDesc",this.getCategoryDesc(),Field.Store.YES,Field.Index.NOT_ANALYZED ));
        doc.add(new Field("text", this.getText(), Field.Store.YES, Field.Index.ANALYZED, TermVector.WITH_POSITIONS_OFFSETS));
        doc.add(new Field("html", this.getHtml(), Field.Store.YES, Field.Index.ANALYZED, TermVector.WITH_POSITIONS_OFFSETS));
        
        return doc;
        
        
    }
......생략...............
Colored by Color Scripter
cs

Field의 매개변수는 org.apache.lucene.document.Field.Field(String name, String value, Store store, Index index, TermVector termVector) 또는 org.apache.lucene.document.Field.Field(String name, String value, Store store, Index index) 를 많이 쓴다.

store의 옵션은 총 2가지 이며, 'Field.Store.YES 는 value를 저장 할 것인다.' 이며

'Field.Store.NO 는 value를 저장 안 할 것인다.' 이다.

이말은 value는 단순히 Field의 plain text를 말하는 것이지 index된 값을 말하는것이 아니다.

Field.Index의 옵션은 총 3가지이며 NOT_ANALYZED,NO, ANALYZED 가 있다.

NOT_ANALYZED는 field의 값을 분석을 안한다는 말이며, plaintext의 값과 검색 시 비교는 할 수있다. rdb의 특정 컬럼 비교라고 생각 하면 된다.

NO의 경우는 검색을 지원하지 않는다. 단순한 값 저장 시 사용된다.

ANALYZED는 Field의 값을 분석을 하며 이 분석된 값을 색인으로 만든다.

TermVector의 경우, 색인의 특정 값을 보고자할때(?) 사실 본인의 경우 debugging용으로 사용 또는 term 추출 시 사용한다.

TemrVector는 index의 offset, 등장 횟수 등을 저장한다.

다음 글 부터 검색 과정을 예제 소스에 대해 자세히 설명하겠다.

posted by 제스트

제스트의 블로그

Category

Notice

Tag

calendar

Recent Post

Recent Comment

Archive

My Link

lucene 기초 - index

티스토리툴바