Understanding Strings In COM
System Notes
To replicate the steps described in this article, you'll need Windows 95+ or Windows NT 4.0+ and Visual C++ 5.0 or higher.
ANSI and Unicode, char and wchar_t were not enough: COM introduced several new string data types, and the differences and the process of conversion are not always obvious to the uninitiated. This article clarifies the situation once and for all for the benefit of raw COM, ATL and MFC programmers.
Strings, i.e. vectors of alphanumeric characters, are and have always been a fundamental data type in every programming language and platform. Whereas the computer itself prefers to deal with numbers, human beings prefer messages of text to sequences of binary, hexadecimal or even decimal digits. This implies that whenever a piece of software needs to interact with the user (or signal some notable events) some kind of string treatment is likely to come into play.
Until a few years ago strings were just strings, that is, arrays of single-byte data types (char in C/C++) containing the ASCII number of the character at each element. The biggest problem was distinguishing zero-terminated strings (also known as ASCIIZ) from non-zero-terminated arrays. Then came Unicode, a new character set which extended the size of each character from 8 to 16 bits, thus allowing for 65536 theoretical different characters, enough to contain also Far Eastern symbols such as the Kanji standard set. In C/C++ a brand new standard data type was defined to store Unicode strings, wchar_t, and consequently the APIs of Unicode-aware Win32 operating systems that took strings as parameters had to be duplicated to accept both ANSI and Unicode versions.
Just as the Windows programmer community began to get acquainted with this duplication and got into the habit of not assuming anything about the length of a character a priori, COM jumped to the central stage with its burden of new types and aliases. If you are wondering what is the functional difference between an array of OLECHARs and a pointer to a BSTR, when and how it is necessary to convert a string to another type, and what degree of assistance ATL and MFC offer to the developer, this article is for you.
OLECHARs
The main string data type in COM is named OLECHAR, which is the kind of variable expected by almost all COM library functions and well-educated interfaces' methods. An OLECHAR represents a single OLE-compatible character, therefore you can speak of a string only when you have an array of OLECHARs. It is obvious to everyone who has utilized C++ for some time that there is not an OLECHAR built-in data type in the language, as underlined (among other things) by the upper case of the name. The C and C++ standard specifications dictate the existence of only two character types: char and wchar_t. Hence, OLECHAR must be an alias to one of them, and in fact it is. Its relation is established by the standard Win32 header file wtypes.h, which we will meet again later in this article. The following code snippet, adapted from the header file for clarity, represents the official definition of OLECHAR in C/C++:
#if defined(_WIN32) && !defined(OLE2ANSI)
typedef WCHAR OLECHAR;
#else
typedef char OLECHAR;
#endif
The same file defines also the LPOLESTR and LPCOLESTR types:
#if defined(_WIN32) && !defined(OLE2ANSI)
typedef OLECHAR __RPC_FAR *LPOLESTR;
typedef const OLECHAR __RPC_FAR *LPCOLESTR;
#else
typedef LPSTR LPOLESTR;
typedef LPCSTR LPCOLESTR;
#endif
as aliases of OLECHAR* and const OLECHAR* in Win32, but aliases of LPSTR and LPCSTR in Windows 3.1x. The __RPC_FAR symbol can be ignored as it expands to nothing, so for all practical purposes BSTR and OLECHAR* can be deployed interchangeably.
As you can see, the BSTR type does not map to the same actual built-in type on every platform. If the code is compiled on 32-bit Windows, which can be detected from the _WIN32 preprocessor symbol definition, all COM characters are Unicode string (WCHAR is itself a typedef'ed data type that translates to the built-in wchar_t type). If not, then the build command is probably targeting Windows 3.1x, which does not support Unicode strings at all, so all the strings are regular old arrays of char. Note that on Sun Solaris, the main UNIX flavor to benefit from a porting of the (D)COM implementation to date, OLECHARs are 16-bit Unicode characters exactly as on Win32.
The original Microsoft engineers who designed COM made a pretty courageous decision: They de facto imposed Unicode to everyone in the 32-bit world at a time when the original version of Windows NT was barely taking shape and the doubled amount of RAM required to hold the same strings could easily become problematic due to the high cost of memory. But the decision proved advantageous, as it saved COM developers from having to implement two variants of each interface (and relative coclasses implementing it) just to deal with every possible type of client.
Now we have seen how to define a COM-compliant character and by extension a COM-compliant string, but we have not revealed yet how one can initialize such a string with a string literal. The following statement:
const OLECHAR* pComStr;
pComStr = "I love VCDJ and COM";
does work in Windows 3.1x because only ANSI strings exist there, but will fail to compile on Win32 and Solaris because we are trying to copy an ANSI string to a Unicode array of characters. The following form:
const OLECHAR* pComStr;
pComStr = L"I love VCDJ and COM";
will give the exact opposite results: working on Win32, incorrect on Windows 3.1. What we really need is a way to define the type of a string irrespective of the platform. Nothing could fit the bill better than a macro, as in the code below:
const OLECHAR* pComStr;
pComStr = OLESTR("I love VCDJ and COM");
The OLESTR() macro is translated differently depending on the target of the build process, so we obtain the correct definition in all cases. Wtypes.h reports it as follows, with some secondary adjustments made to clarify the original code:
#if defined(_WIN32) && !defined(OLE2ANSI)
#define OLESTR(str) L##str
#else
#define OLESTR(str) str
#endif
Note: In all other Win32 API implementations there is a discrepancy between Windows 95 / Windows 98 and Windows NT's string treatment, since the former employs one-byte ANSI characters and the latter internally works only with two-byte Unicode characters. However, when it comes to COM, both operating systems agree on the use of Unicode strings.
At this point you may be curious as to why the data type was called OLECHAR rather than the more obvious COMCHAR. The answer to this question has its roots partly in history and partly in marketing: until a few years ago OLE2, the main family of technologies relying on the COM foundation, was deemed more important than COM itself, hence the acronym OLE spread everywhere. The later change of marketing orientation could not be reflected in the symbol names to avoid breaking a lot of existing and correctly functioning COM/OLE code. (See my Q&A column in VCDJ print and online for extensive info on this sometimes unclear transition of terms and intents.)
OLECHARs are the standard way to create strings in COM code and by far the most comfortable as long as C and C++ are used in both the client side and the server side. Other languages and tools bring their burden of special constraints that open the way to another kind of string, which constitute the topic of the next paragraph.
Copyright © 1999 - Visual C++ Developers Journal
BSTRs
B-strings, more properly called Basic strings, are a special kind of string format. Instead of comprising a classic array of characters followed by a NUL character (code \0) that marks the termination of the array, the structure of the data in memory is a superset of OLECHAR. In short, a BSTR is a null-terminated array of OLECHARs prefixed by its length. The string length is determined by the character count, not by the index of the first null character.
This presence of the length of the object before the actual array data renders these strings suitable for manipulation in high-level tools like Visual Basic (for which this string format was invented in the first place) and Java on a COM-aware virtual machine like Microsoft's JVM. Actually, there is no other way to exchange string-like data with components written in those languages than to employ BSTRs. While in C and C++ the developer has to understand and use the data type in a rather uncomfortable manner, both Visual Basic and Java encapsulate them into their traditional string types, respectively String and java.lang.String. The final developer is therefore shielded from the subtleties of the organization of the raw bytes in memory. Moreover, the tools take care of allocating and freeing the memory required to contain their content without the programmer needing to know how this process works behind the scenes.
This is the brilliant side of the medal of course. You as the C/C++ hardcore engineer get the tough part of the work, since you need to learn a completely new specific set of APIs that carry out the basic operations with Basic strings. The family of functions is amazingly named "system strings management API" and its members can easily be distinguished by the "Sys" prefix in their names.
The following code snippet, borrowed from Oleauto.h (this stuff used to be most useful when coupled with Automation, as Visual Basic's COM support was a lot less powerful then), shows the prototypes of each of the functions in the group:
/*---------------------------------------------------------------------*/
/* BSTR API */
/*---------------------------------------------------------------------*/
WINOLEAUTAPI_(BSTR) SysAllocString(const OLECHAR *);
WINOLEAUTAPI_(INT) SysReAllocString(BSTR *, const OLECHAR *);
WINOLEAUTAPI_(BSTR) SysAllocStringLen(const OLECHAR *, UINT);
WINOLEAUTAPI_(INT) SysReAllocStringLen(BSTR *, const OLECHAR *, UINT);
WINOLEAUTAPI_(void) SysFreeString(BSTR);
WINOLEAUTAPI_(UINT) SysStringLen(BSTR);
#ifdef _WIN32
WINOLEAUTAPI_(UINT) SysStringByteLen(BSTR bstr);
WINOLEAUTAPI_(BSTR) SysAllocStringByteLen(LPCSTR psz, UINT len);
#endif
Don't be unnerved by the probably unfamiliar WINOLEAUTAPI_() word preceding all the functions; it is simply a macro defined in the same header file that expands to a long list of modifiers necessary to adjust the calling convention, exportation details, and return type. You can blissfully ignore it for our purposes.
The following table briefly describes the task of each routine:
Function name
Description
SysAllocString()
Allocates a new BSTR and initializes it with an OLECHAR*
SysReAllocString()
Reallocates an existing BSTR and initializes it with an OLECHAR*
SysAllocStringLen()
Allocates a new BSTR, copies a specified number of characters from the passed OLECHAR* into it, and then appends a null character
SysReAllocStringLen()
Reallocates an existing BSTR, copies a specified number of characters from the passed OLECHAR* into it, and then appends a null character
SysFreeString()
Deallocates a BSTR
SysStringLen()
Returns the number of characters in a BSTR
SysStringByteLen()
Returns the length in bytes of a BSTR (Win32 only)
SysAllocStringByteLen()
Allocates a BSTR that contains the ANSI string passed as a parameter. Does not perform any ANSI-to-Unicode translation (Win32 only)
The succinct description provided above, in conjunction with the official documentation, should be everything you will ever need to know to deal with BSTRs. Note that the expected usage pattern is the preventive allocation of an array of OLECHARs, which is later copied into the system string.
Basic strings must be allocated and freed manually. But who has the responsibility of doing so when function calls are involved? This is a general COM question and so the answer does not apply solely to strings. If the parameter is input-only (IDL attribute [in]) the caller is responsible for both the creation and the destruction of the variable. If the parameter is output-only (IDL attribute [out]) then the callee is responsible for the allocation of the string, but the caller is expected to free it after use. If the parameter is both input and output (IDL attribute [in, out]) then the caller allocates the string and after the method invocation frees the memory. The callee though is allowed to reallocate the string if necessary to do so before returning it to the caller.
Obviously these details interest C/C++ developers only, as Visual Basic will continue to treat strings as usual without any special consideration.
BSTR wrappers
Both ATL and MFC offer particular support for simplified BSTR management. ATL does it by means of a specialized wrapper class, CComBSTR, whose declaration in atlbase.h looks like the following (stripped down as usual for clarity and space constraints):
class CComBSTR
{
public:
BSTR m_str;
CComBSTR();
CComBSTR(int nSize, LPCOLESTR sz = NULL);
CComBSTR(LPCOLESTR pSrc);
CComBSTR(const CComBSTR& src);
CComBSTR& operator=(const CComBSTR& src);
CComBSTR& operator=(LPCOLESTR pSrc);
~CComBSTR();
unsigned int Length() const;
operator BSTR() const;
BSTR* operator&();
BSTR Copy() const;
void Attach(BSTR src);
BSTR Detach();
void Empty();
#if _MSC_VER>1020
bool operator!();
#else
BOOL operator!();
#endif
void Append(const CComBSTR& bstrSrc);
void Append(LPCOLESTR lpsz);
void AppendBSTR(BSTR p);
void Append(LPCOLESTR lpsz, int nLen);
CComBSTR& operator+=(const CComBSTR& bstrSrc);
#ifndef OLE2ANSI
CComBSTR(LPCSTR pSrc);
CComBSTR(int nSize, LPCSTR sz = NULL);
CComBSTR& operator=(LPCSTR pSrc);
void Append(LPCSTR);
#endif
HRESULT WriteToStream(IStream* pStream);
HRESULT ReadFromStream(IStream* pStream);
};
The utilization of the class is very straightforward even for the non-ATL experts. Basically the features offered are:
encapsulation of the allocation and deallocation procedures within the constructor and destructor;
duplication of the contents (through CComBSTR::Copy());
possibility to append almost any kind of string to the wrapped BSTR exploiting the overloading feature of C++;
support for readable string comparisons through the customized ! operator;
basic I/O operations to store the contents of the string to, and retrieve it from, a structured storage stream.
On the other hand, MFC does not provide any direct wrapper class for system strings. All the support is an integral part of the extremely versatile Cstring class. As shown in the following code snippet borrowed from the class's prototype in afx.h, there are only a couple of methods specifically generating COM strings:
// OLE BSTR support (use for OLE automation)
BSTR AllocSysString() const;
BSTR SetSysString(BSTR* pbstr) const;
Internally CString::AllocSysString() allocates a new BSTR using the APIs we examined in an earlier paragraph and copies its contents to the newly created system string, which is eventually returned to the caller. There is no such function as CString::FreeSysString(), so to deallocate the memory occupied by the returned BSTR, the global API ::SysFreeString() will have to be called. CString::SetSysString() instead reallocates the BSTR pointed to by the parameter and copies its contents into it. Both methods throw CmemoryException exception objects in case of memory allocation problems.
Moreover, if you are using Visual C++ 5.0 or higher, you can exploit the Direct To COM proprietary extension which includes, among many other things, a _bstr_t class. The documentation reports that it is defined inside comdef.h, while in reality its declaration resides in comutil.h. The degree of encapsulation and functionality is similar to ATL's CComBSTR, but remember that using the COM compiler support binds you to Visual C++ even more than ATL would do. Probably the most relevant difference between the two implementations is that _bstr_t raises C++ exceptions and thus requires your code to be prepared to catch them, whereas CComBSTR does not. This detail will likely influence your choice more than all the other possible considerations. The following code listing summarizes the public interface of _bstr_t; the comments should make it easy to understand what the diverse method groups are up to:
class _bstr_t {
public:
// Constructors
//
_bstr_t() throw();
_bstr_t(const _bstr_t& s) throw();
_bstr_t(const char* s) throw(_com_error);
_bstr_t(const wchar_t* s) throw(_com_error);
_bstr_t(const _variant_t& var) throw(_com_error);
_bstr_t(BSTR bstr, bool fCopy) throw(_com_error);
// Destructor
//
~_bstr_t() throw();
// Assignment operators
//
_bstr_t& operator=(const _bstr_t& s) throw();
_bstr_t& operator=(const char* s) throw(_com_error);
_bstr_t& operator=(const wchar_t* s) throw(_com_error);
_bstr_t& operator=(const _variant_t& var) throw(_com_error);
// Operators
//
_bstr_t& operator+=(const _bstr_t& s) throw(_com_error);
_bstr_t operator+(const _bstr_t& s) const throw(_com_error);
// Friend operators
//
friend _bstr_t operator+(const char* s1, const _bstr_t& s2);
friend _bstr_t operator+(const wchar_t* s1, const _bstr_t& s2);
// Extractors
//
operator const wchar_t*() const throw();
operator wchar_t*() const throw();
operator const char*() const throw(_com_error);
operator char*() const throw(_com_error);
// Comparison operators
//
bool operator!() const throw();
bool operator==(const _bstr_t& str) const throw();
bool operator!=(const _bstr_t& str) const throw();
bool operator<(const _bstr_t& str) const throw();
bool operator>(const _bstr_t& str) const throw();
bool operator<=(const _bstr_t& str) const throw();
bool operator>=(const _bstr_t& str) const throw();
// Low-level helper functions
//
BSTR copy() const throw(_com_error);
unsigned int length() const throw();
private:
// [...private stuff omitted...]
}
Copyright © 1999 - Visual C++ Developers Journal
Frameworks and conversions
Your ideas of OLECHAR and BSTR and your understanding of the manner COM handles strings should be much clearer now, but we still have to cope with type conversions to and from these somewhat special data types and the more traditional TCHAR, WCHAR and char.
ATL and MFC both use the same group of macros to deal with string conversions. These macros' names follow a precise convention: the characters before the "2" indicate the original type of the variable to convert, and the characters after the "2" indicate the destination type after the conversion. The following table lists the valid symbols in a conversion macro name:
Short name
Data type
A
LPSTR, char*
OLE
LPOLESTR
T
LPTSTR, TCHAR*
W
LPWSTR, wchar_t*
BSTR
BSTR
C
const - associated to another type
The macros operate intelligently: if for some reason the source and destination types coincide, the code does not waste time in a useless process. Internally most of the macros call the _alloca() run-time library function and allocate the storage for the new data on the stack, as this simplifies the deallocation policy by delegating it to the rules of the variables scope. For this reason, a USES_CONVERSION macro must be put just before the conversion operation in each function or class method that contains the macros. The following sample code, taken from the downloadable sample pack available on the Web, will clarify the process:
// Conversions through MFC/ATL's macros
void Sample2()
{
USES_CONVERSION;
// ANSI
LPSTR ansiStr = "This is a sample message";
printf("BEFORE the string contains: %s\n", ansiStr);
// ANSI -> const TCHAR
const TCHAR* pTChar = A2CT(ansiStr);
_tprintf(_T("MIDWAY the string contains: %s\n"), pTChar);
// const TCHAR -> Unicode
LPWSTR wStr = T2W(pTChar);
wprintf(L"AFTER the string contains: %s\n", wStr);
}
As I stated earlier, the conversion macros are part of MFC and ATL, but surprisingly the header files are not directly shared by the two frameworks. ATL programmers should include Atlconv.h, while MFC developers are supposed to include Afxconv.h in their projects. After digging into the sources I found that in the latest versions of MFC, Afxconv.h does little more than include Atlconv.h itself, so in practice the string conversion code exposed by the two frameworks is the same.
The COM compiler support offers good BSTR conversion code, too. The actual conversion functions are the cast operators that convert a _bstr_t to either an ANSI or a Unicode string, either constant or not, plus the omnipresent class constructors. The following code snippet, taken from the downloadable sample pack available on the Web, shows some common usage patterns of BSTR conversions:
// BSTR conversions
void Sample3()
{
USES_CONVERSION;
LPWSTR wStr = L"This is a sample message";
wprintf(L"BEFORE the string contains: %s\n", wStr);
BSTR bstr1 = W2BSTR(wStr);
CString mfcStr = bstr1;
printf("MIDWAY the string contains: %s\n", (LPCSTR)mfcStr);
BSTR bstr2 = mfcStr.AllocSysString();
_bstr_t bstr3 = bstr2;
VERIFY(bstr3 == (_bstr_t)bstr1);
WCHAR* wStr2 = bstr3;
wprintf(L"AFTER the string contains: %s\n", wStr2);
::SysFreeString(bstr1);
::SysFreeString(bstr2);
}
This is music to the ears of those who work with more or less advanced frameworks, but what about those who prefer (or are compelled) to stick to low-level C++ COM development? They can still use the standard library conversion function, which the framework macros themselves rely on ultimately, such as mbstowcs() and wcstombs(). Unfortunately such functions are not aware of BSTRs and OLECHARs, so heavy use of conditional compilation would be required to deal with the various combinations possible, and this is evil from a readability perspective. Power developers who program COM at the raw C++ level will probably find out naturally at a certain point in their learning curve and experience how to write a set of conversion macros by themselves. If you are lazy and prefer to use a precooked set of macros, you can freely use Don Box's YACL (the acronym stands for "Yet Another COM Library") which is extremely efficient and does much more than just string conversion in pure C++ COM. The URL for the download is http://www.develop.com/dbox/yacl.htm.
Conclusion
After some study and direct experimentation, the various string types in COM prove to be much less cryptic and problematic than at first they seemed. Fundamentally, it all boils down to recognizing and memorizing a handful of new data types which may behave differently on different platforms, and getting used to the framework of handy conversion functions provided by ATL, MFC, or the Direct To COM Visual C++ extension. Regardless of which of the mentioned tasks you are going to tackle, I hope this article will serve as a valuable aid in saving precious time working with strings.
Copyright © 1999 - Visual C++ Developers Journal