Nutrient

Home

SDK

Software Development Kits

Low-Code

IT Document Solutions

Workflow

Workflow Automation Platform

DWS API

Document Web Services

T
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Company

About

Team

Careers

Contact

Security

Partners

Legal

Resources

Blog

Events

Try for free

Contact Sales
Contact sales
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

products

Web

Web

Document Authoring

AI Assistant

Salesforce

Mobile

iOS

Android

visionOS

Flutter

React Native

MAUI

Server

Document Engine

Document Converter Services

.NET

Java

Node.js

AIDocument Processing

All products

solutions

USECASES

Viewing

Editing

OCR and Data Extraction

Signing

Forms

Scanning & Barcodes

Markup

Generation

Document Conversion

Redaction

Intelligent Doc. Processing

Collaboration

Authoring

Security

INdustries

Aviation

Construction

Education

Financial Services

Government

Healthcare

Legal

Life Sciences

All Solutions

Docs

Guides overview

Web

AIAssistant

Document Engine

iOS

Android

visionOS

Java

Node.js

.NET

Document Converter Services

Downloads

Demo

Support

Log in

Resources

Blog

Events

Pricing

Try for free

Free Trial

Company

About

Security

Partners

Legal

Contact Sales
Contact Sales
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

products

Products overview

Document Converter

Document Editor

Document Searchability

Document Automation Server

Integrations

SharePoint

Power Automate

Nintex

OneDrive

Teams

Window Servers

solutions

USECASES

Conversion

Editing

OCR Data Extraction

Tagging

Security Compliance

Workflow Automation

Solutions For

Overview

Legal

Public Sector

Finance

All Solutions

resources

Help center

Document Converter

Document Editor

Document Searchability

Document Automation Server

learn

Blog

Customer stories

Events

Support

Log in

Pricing

Try for free

Company

About

Security

Partners

Legal

Contact Sales
Contact Sales
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Product

Product overview

Process Builder

Form Designer

Document Viewer

Office Templating

Customization

Reporting

solutions

Industries

Healthcare

Financial

Manufacturing

Pharma

Education

Construction

Nonprofit

Local Government

Food and Beverage

Departments

ITServices

Finance

Compliance

Human Resources

Sales

Marketing

Services

Overview

Capex-accelerator

Process Consulting

Workflow Prototype

All Solutions

resources

Help center

guides

Admin guides

End user guides

Workflow templates

Form templates

Training

learn

Blog

Customer stories

Events

Support

Pricing

Support

Company

About

Security

Partners

Legal

Try for Free
Contact Sales
Try for Free
Contact Sales
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Services

Generation

Editing

Conversion

Watermarking

OCR

Table Extraction

Pricing

Docs

Log in

Try for Free
Try for Free

Free trial

Blog post

String Literals, Character Encodings, and Multiplatform C++

Daniel Martín Daniel Martín

Table of contents

  • An Introduction to Character Encoding
  • String Literals in C++
  • Clang
  • MSVC
  • Our Recommendations for Simple Clang and MSVC Interoperability
  • Conclusion
Illustration: String Literals, Character Encodings, and Multiplatform C++

Today’s C++ software needs to support text in multiple languages. If you want to target multiple platforms and compilers, understanding the little details about how string encoding works is crucial to delivering software that is correct, and to writing multilingual tests that are reliable when you run them on multiple platforms.

In this blog post, I’ll take a look at how two of the most popular C++ compilers, Clang and MSVC, encode the bytes in source code files into strings in a running application. This knowledge may seem obscure, but it’s important to understand and be able to reason about character-encoding issues that you may encounter when using string literals that contain text that isn’t in English.

An Introduction to Character Encoding

Strings in C++ programs are ubiquitous, but computers don’t understand the concept of strings. Instead, they represent strings as bytes — that is, just ones and zeroes. Enter the idea of a character encoding (formally, a character-encoding scheme), which is a method that maps between raw bytes and the string they represent. For example, the string “Hello” may have the byte representation “01001000 01100101 01101100 01101100 01101111” in a particular character encoding. There are several character encodings, and some of the most popular ones are:

  • UTF-8

  • UTF-16

  • Latin-1

  • JIS

This article only requires basic familiarity with the most popular character encodings described above. If you want additional details about how they actually encode data, see, for example, UTF-8 Everywhere.

String Literals in C++

String literals are a programming language feature. This feature serves to represent string values within the source code of a program. C++ supports several types of string literals:

  • "hello" — represents the string “hello”

  • L"hello" — represents the wide string “hello”

  • u8"hello" — represents the string “hello”, encoded in UTF-8

  • u"hello" — represents the string “hello”, encoded in UTF-16

  • U"hello" — represents the string “hello”, encoded in UTF-32

The next section will explain how the most popular C++ compilers convert the bytes in a source file, which possibly contains string literals, to strings.

Clang

Clang is a popular compiler written as part of the LLVM project. Clang simplifies character encoding handling by only supporting UTF-8. This means that if you have a C++ source file encoded in UTF-16 and want to compile it with Clang, the compiler will emit the following error:

fatal error: UTF-16 (BE) byte order mark detected in 'sample.cpp', but encoding is not supported
1 error generated.

As Clang always assumes source files are encoded in UTF-8, string literal handling simply involves converting between UTF-8 strings and what’s known as the “execution encoding,” which can be UTF-8, UTF-16, or UTF-32. Which execution encoding is used depends on the character width of the string literal, and that, at the same time, depends on both how the string literal is declared and the platform that is the target of the compilation. For example:

  • "hello" — the char width is one byte

  • L"hello" — the char width is four bytes on Unix systems and two bytes on Windows

  • u"hello" — the char width is usually two bytes

  • U"hello" — the char width is usually four bytes

In general, there are only three possible character widths: one byte, two bytes, or four bytes. If the character width is one byte, then Clang simply validates that the literal string is encoded in valid UTF-8. If the character width is two bytes, Clang converts the string literal from UTF-8 to UTF-16. If the character width is four bytes, Clang converts the string literal from UTF-8 to UTF-32.

MSVC

MSVC is the Microsoft compiler included in Visual Studio. The internal handling of string literals and character encodings on Windows is much more complex, because, unlike Clang, MSVC doesn’t assume that every source file is encoded in UTF-8. Unfortunately, we don’t have access to the source code of MSVC to check what it does with literal strings, but the general process is described in this blog post. The rest of this section will distill the key details from that blog post.

From the Source File Encoding to UTF-8

Recent versions of MSVC encode strings internally as UTF-8. MSVC follows this process:

  • If the source file has a byte order mark (BOM), then MSVC converts the encoding from UTF-16 to UTF-8 (or it leaves it as UTF-8 if it was originally encoded as UTF-8 with a BOM).

  • If the source file doesn’t have a BOM, then it tries to detect if the source file is encoded in UTF-16 by looking at the first eight bytes of the file.

  • If the source file doesn’t have a BOM and is not encoded in UTF-16, which is the typical case if the file was created in a Unix-like system, then MSVC decodes the source file using the system’s code page, and then it encodes the result in UTF-8.

From UTF-8 to the Execution Character Set

As described before in the section on string literals in C++, different string literals in C++ may have different character set representations. MSVC needs to convert between the encoding it uses internally, UTF-8, and the desired character set of the string literal.

This conversion again depends on the system’s code page when the literal string doesn’t have a prefix:

  • "hello" will be encoded in the system’s code page (generally Windows-1252 on English systems)

  • L"hello" will be encoded in UTF-16

  • u"hello" will be encoded in UTF-16

  • U"hello" will be encoded in UTF-32

Visual Studio 2015 Introduced Two Important Flags

As you can see, character encoding is a complex topic in MSVC, as there’s a dependency on the system’s code page and the source file encoding. To try to make things simpler, Microsoft introduced two compiler flags in Visual Studio 2015:

  • /source-charset — This specifies the character encoding that will be used when decoding a source file.

  • /execution-charset — This specifies the character encoding that will be used when encoding data into the execution encoding.

For example, using /source-charset:utf8 decodes the source file as UTF-8, irrespective of the encoding of the source file or the system’s code page. This configuration makes MSVC work like Clang, in that it assumes the source file is encoded in UTF-8. Most source files nowadays are stored in UTF-8, anyway.

Using /execution-charset:utf-8, for example, lets us avoid the problem of depending on the system’s code page when C++ string literals are declared without a prefix. UTF-8 will always be used.

If you simply want both /source-charset:utf-8 and /execution-charset:utf-8, then you can pass the convenient /utf8 compiler flag.

Our Recommendations for Simple Clang and MSVC Interoperability

You can see that the variability in how compilers interpret multibyte strings can be very confusing. We recommend you consider these rules to ensure your multilingual test files are interpreted correctly in both Unix-like and Windows systems:

  • You can store source files that contain non-English strings in UTF-8 with a BOM. One of the problems with this solution is that storing files in UTF-8 with a BOM isn’t common, and it requires that you explicitly include the BOM when you save the file in your text editor or IDE. Another problem is that it’s possible to remove the BOM by mistake if you process source files using regular expressions. This kind of mistake is easy to miss in a code review because code review tools don’t show things like the BOM in a clear way. We consider saving files in UTF-8 with a BOM a possible, but fragile, approach.

  • Alternatively, you can pass the /utf8 compiler flag to MSVC. This flag simplifies things because MSVC will assume UTF-8 encoding, just like Clang. The only source of variability will be that the size of a wide char in Unix-like systems is four bytes, but on Windows systems it’s two bytes. If you use wide strings in your source files, you might need to take this into account, depending on your use case.

  • Write string literals explicitly encoded as UTF-8 using the u8 prefix and \uXXXX escape sequences. For example, u8"\u0048\u0065\u006C\u006C\u006F would reliably encode the string “Hello” as UTF-8. The tradeoff of this solution is that it’s less readable. However, it may be the best option when the character would otherwise be invisible. For example, u8"\uFEFF" represents the Unicode character “zero-width no-break space,” which is invisible in source files.

  • Avoid wide strings if possible. As mentioned in the second recommendation, the size of a wide character is different depending on the platform. Using wchar_t and the corresponding string type std::wstring will introduce complexity if your code or test code relies implicitly or explicitly on the size of a wide character and needs to run on Unix-like and Windows systems. In some cases, using a wide character string is inevitable, like when interfacing with some Windows APIs.

Conclusion

This blog post explored the little details of character encoding in source files, C++ string literals, and compilers. We hope to have improved your understanding of these tricky concepts. We think that writing good multilingual software often requires a solid grasp of things like character encoding and how popular compilers work.

Author
Daniel Martín
Daniel Martín Core Engineer

Daniel is part of the Core Team at Nutrient and has worked on multiple topics, ranging from cryptography and text systems, to file format support and JavaScript engines. Outside of work, he likes spending time with his family, football, reading books, and watching films.

Explore related topics

C++ Tips Insights Development
Free trial Ready to get started?
Free trial

Related articles

Explore more
SDKDEVELOPMENTC++Development

Structure Padding in C++

SDKDEVELOPMENTWebC++Development

Systems programming meets modern web development

SDKWebC++WebAssembly

Render Performance Improvements in PSPDFKit for Web

Company
About
Security
Team
Careers
We're hiring
Partners
Legal
Products
SDK
Low-Code
Workflow
DWS API
resources
Blog
Events
Customer Stories
Tutorials
News
connect
Contact
LinkedIn
YouTube
Discord
X
Facebook
Popular
Java PDF Library
Tag Text
PDF SDK Viewer
Tag Text
React Native PDF SDK
Tag Text
PDF SDK
Tag Text
iOS PDF Viewer
Tag Text
PDF Viewer SDK/Library
Tag Text
PDF Generation
Tag Text
SDK
Web
Tag Text
Mobile/VR
Tag Text
Server
Tag Text
Use Cases
Tag Text
Industries
Tag Text
Resources
Blog
Tag Text
Events
Customer Stories
Tag Text
Tutorials
Tag Text
Features List
Tag Text
Compare
Tag Text
community
Free Trial
Tag Text
Documentation
Tag Text
Nutrient Portal
Tag Text
Contact Support
Tag Text
Company
About
Tag Text
Security
Tag Text
Careers
Tag Text
Legal
Tag Text
Pricing
Tag Text
Partners
Tag Text
connect
Contact
Tag Text
LinkedIn
Tag Text
YouTube
Tag Text
Discord
Tag Text
X
Tag Text
Facebook
Tag Text
low-code
Document Converter
Tag Text
Document Editor
Tag Text
Document Automation Server
Tag Text
Document Searchability
Tag Text
Use Cases
Tag Text
Industries
Tag Text
Resources
Blog
Tag Text
Events
Customer Stories
Tag Text
Support
Help Center
Tag Text
Contact Support
Tag Text
Log In
Tag Text
Company
About
Tag Text
Careers
Tag Text
Security
Tag Text
Legal
Tag Text
Pricing
Tag Text
Partners
Tag Text
connect
Contact
Tag Text
LinkedIn
Tag Text
YouTube
Tag Text
Discord
Tag Text
X
Tag Text
Facebook
Tag Text
Popular
Approvals matrix
Tag Text
BPMS
Tag Text
Budgeting process
Tag Text
CapEx approval
Tag Text
CapEx automation
Tag Text
Document approval
Tag Text
Task automation
Tag Text
workflow
Overview
Tag Text
Services
Tag Text
Industries
Tag Text
Departments
Tag Text
Resources
Blog
Tag Text
Events
Customer Stories
Tag Text
Support
Help Center
Tag Text
FAQ
Tag Text
Troubleshooting
Tag Text
Contact Support
Tag Text
Company
About
Tag Text
Careers
Tag Text
Security
Tag Text
Legal
Tag Text
Pricing
Tag Text
Partners
Tag Text
connect
Contact
Tag Text
LinkedIn
Tag Text
YouTube
Tag Text
Discord
Tag Text
X
Tag Text
Facebook
Tag Text
DWS api
PDF Generator
Tag Text
Editor
Tag Text
Converter API
Tag Text
Watermark
Tag Text
OCR
Tag Text
Table Extraction
Tag Text
Resources
Log in
Tag Text
Help Center
Tag Text
Support
Tag Text
Blog
Tag Text
Company
About
Tag Text
Careers
Tag Text
Security
Tag Text
Pricing
Tag Text
Legal
Privacy
Tag Text
Terms
Tag Text
connect
Contact
Tag Text
X
Tag Text
YouTube
Tag Text
Discord
Tag Text
LinkedIn
Tag Text
Facebook
Tag Text

Copyright 2025 Nutrient. All rights reserved.

Thank you for subscribing to our newsletter!

We’re thrilled to have you join our community. You’re now one step closer to receiving the latest updates, exclusive content, and special offers directly in your inbox.

This builtin is not currently supported: DOM

PSPDFKit is now Nutrient. We've consolidated our group of trusted companies into one unified brand: Nutrient. Learn more

This builtin is not currently supported: DOM

PSPDFKit is now Nutrient. We've consolidated our group of trusted companies into one unified brand: Nutrient. Learn more

This builtin is not currently supported: DOM

New Feature Release. Tap into revolutionary AI technology to instantly complete tasks, analyze text, and redact key information across your documents. Learn More or View Showcase

This builtin is not currently supported: DOM

Aquaforest and Muhimbi are now Nutrient. We've consolidated our group of trusted companies into one unified brand: Nutrient. Learn more

This builtin is not currently supported: DOM

Integrify is now Nutrient. We've consolidated our group of trusted companies into one unified brand: Nutrient. Learn more

This builtin is not currently supported: DOM

Join us on April 15th. Join industry leaders, product experts, and fellow professionals at our exclusive user conference. Register for conference