12 | February | 2009 | I'm on your side, when times get rough.

2009-02-12

Detecting Multi-Language Codepage #3

Filed under: Programming — Peter_KIM @ 09:12

지난번에 C++ 코드로 작성한 예제를 보았습니다. 어떤 분들에게는 거의 무용지물의 코드일 수도 있을 것입니다.
이번에는 .NET Framework 아래에서 C# 언어를 이용하여 텍스트 파일의 코드페이지를 알아보겠습니다.

.NET Framework 클래스 라이브러리는 Win32 기반의 API 함수에서 제공되는 것보다 더 편리한 함수를 제공합니다.
우리는 Encoding.GetPreamble 함수를 눈 여겨 보아야 합니다. 이 함수를 이용하여, 유니코드로 저장된 텍스트 파일의 BOM 값을 확인할 수 있습니다. Encoding.GetPreamble 함수를 설명한 MSDN 문서에는 다음과 같은 글이 있습니다.

유니코드 BOM(바이트 순서 표시)은 다음과 같이 16진수로 serialize됩니다.

UTF-8: EF BB BF

UTF-16 big endian 바이트 순서: FE FF

UTF-16 little endian 바이트 순서: FF FE

UTF-32 big endian 바이트 순서: 00 00 FE FF

UTF-32 little endian 바이트 순서: FF FE 00 00

만일 텍스트 파일이 ASNI 문자들로 저장되어 있고, BOM 설정이 없다면, 이전에 사용한 Win32 API 함수를 사용해서 처리할 수 밖에 없을 것입니다. 결국은, Win32 API 함수입니다.

프로그램은 다음과 같은 순서에 의하여, 텍스트 파일의 코드 페이지를 검사할 것입니다.

Win32 IsTextUnicode 함수를 이용하여, 유니코드(UTF-16 Little Endian) 형식인지를 판단합니다. 만일 텍스트 파일에 BOM 설정이 없는 경우, 유용할 수 있습니다.
Encoding.GetPreamble 함수에서 얻어진 Preamble 바이트 값들을 텍스트 파일에서 읽은 스트림과 비교하여, 유니코드 코드페이지 값을 판단합니다.
Win32 DetectInputCodepage 함수를 이용하여, 텍스트 파일의 코드페이지 값을 판단합니다.

C++ 코드 설명 부분에서도 이야기 했지만, 위의 세 단계를 거친다고 해서, 모든 텍스트 파일의 코드페이지 값을 알아내기는 어렵습니다. Windows 운영 체제를 사용하는 시스템에서 만들어진(정확히, 사용자가 특정 바이트를 조작하지 않은 경우) 대부분의 텍스트 파일만을 검사할 수 있다는 것입니다.

using System;

using System.IO;

using System.Runtime.InteropServices;

using System.Text;

using MultiLanguage;

namespace DetectML

{

class CMLang

{

public enum MLDETECTCP

{

MLDETECTCP_NONE = 0, // Default setting will be used.

MLDETECTCP_7BIT = 1, // Input stream consists of 7-bit data.

MLDETECTCP_8BIT = 2, // Input stream consists of 8-bit data.

MLDETECTCP_DBCS = 4, // Input stream consists of double-byte data.

MLDETECTCP_HTML = 8, // Input stream is an HTML page.

MLDETECTCP_NUMBER = 16 // Not currently supported.

};

[DllImport(“advapi32”, CallingConvention = CallingConvention.StdCall)]

public static extern int IsTextUnicode(Byte[] lpBuffer, int cb, IntPtr lpi);

public Int32 DetectTextFileEncoding(String sFilePath)

{

Int32 nSrcSize = 0;

Int32 nScores = 10;

tagDetectEncodingInfo[] encInfo = null;

IMultiLanguage2 iML2 = null;

Byte[] btInput = null;

SByte[] sbtInput = null;

IntPtr iPtr = IntPtr.Zero;

Int32 nCodepage = -1;

try

{

using (FileStream oStream = File.OpenRead(sFilePath)) {

nSrcSize = (Int32)oStream.Length;

btInput = new Byte[nSrcSize];

oStream.Read(btInput, 0, nSrcSize);

oStream.Close();

}

if (1 == IsTextUnicode(btInput, btInput.Length, IntPtr.Zero)) {

nCodepage = Encoding.Unicode.CodePage;

return nCodepage;

}

nCodepage = DetectUnicodeEncoding(btInput);

if (nCodepage == -1) {

iML2 = new CMultiLanguageClass();

encInfo = new tagDetectEncodingInfo[nScores];

sbtInput = new SByte[nSrcSize];

for (int i = 0; i < nSrcSize; ++i)

sbtInput[i] = (SByte)btInput[i];

iML2.DetectInputCodepage((uint)MLDETECTCP.MLDETECTCP_NONE, 0, ref sbtInput[0], ref nSrcSize, ref encInfo[0], ref nScores);

if (nScores > 0)

nCodepage = (Int32)encInfo[0].nCodePage;

}

} catch (Exception ex) {

throw ex;

}

finally {

if (iML2 != null)

Marshal.FinalReleaseComObject(iML2);

}

return nCodepage;

}

private Int32 DetectUnicodeEncoding(Byte[] btStream)

{

Int32[] nCodepages = new Int32[5] { Encoding.UTF8.CodePage,

Encoding.BigEndianUnicode.CodePage,

Encoding.Unicode.CodePage,

65006, // Unicode (UTF-32 Big-Endian)

Encoding.UTF32.CodePage };

Byte[][] btPreambles = new Byte[5][] { Encoding.UTF8.GetPreamble(), Encoding.BigEndianUnicode.GetPreamble(),

Encoding.Unicode.GetPreamble(),

new Byte[] { 0x00, 0x00, 0xFE, 0xFF}, //Unicode (UTF-32 Big-Endian)

Encoding.UTF32.GetPreamble() };

Int32 nCodepage = -1;

bool bMatch = true;

for (int i = 0; i < btPreambles.Length; ++i) {

bMatch = true;

for (int j = 0; j < btPreambles[i].Length; ++j) {

bMatch &= Byte.Equals(btPreambles[i][j], btStream[j]);

if (!bMatch) break;

}

if (bMatch) {

nCodepage = nCodepages[i];

break;

}

return nCodepage;

}

이 코드를 완벽하게 컴파일 하려면, 다음의 작업을 추가적으로 수행해야 합니다. 코드에 있는, using MultiLanguage;이건 무엇일까 하는 생각을 듭니다. DetectInputCodepage 이 함수를 사용하기 위하여, MIDL, TLBIMP 명령어를 사용해서 Interop 어셈블리를 생성해 주어야 합니다.
이 부분에 대한 설명은 이 글의 주제와 벗어나므로 역시, 생략합니다.

Visual Studio 명령 프롬프트에서 다음의 명령을 차례대로 수행하면, “MultiLanguage.dll” 파일이 생성될 것입니다.

MIDL MLang.Idl
tlbimp mlang.tlb /silent

이렇게 만들어진 어셈블리를 프로젝트에 참조하여야 합니다.
끝으로 구해진 코드페이지 값을 이용하여, 다음과 같이 인코딩 개체를 생성하여 파일을 읽을 수 있습니다.

Encoding oEnc = Encoding.GetEncoding(nCodePage));
String sFileContents = File.ReadAllText(sFilePath, oEnc);

이것으로 세 차례에 걸친 텍스트 파일의 코드페이지 알아보기를 모두 마칩니다.
사용한 모든 소스 코드는 다음의 주소에서 다운로드 할 수 있습니다.
https://skydrive.live.com/embedicon.aspx/.Public/DetectMultiLanguage.7z?cid=1bbcdfedee1c617e

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

I'm on your side, when times get rough.

2009-02-12

Detecting Multi-Language Codepage #3

Categories

Recent Posts

Blog Stats

Meta

I'm on your side, when times get rough.

2009-02-12

Detecting Multi-Language Codepage #3

Share this:

Categories

Recent Posts

Blog Stats

Meta