从字符串中删除特定 unicode 范围的字符-星河码客

从字符串中删除特定 unicode 范围的字符

IT小君 2021-10-21T06:36:07

我有一个程序可以从 twitter 流 api 实时解析推文。在存储它们之前，我将它们编码为 utf8。某些字符最终会以 ?、?? 或 ??? 的形式出现在字符串中。而不是它们各自的 unicode 代码并导致问题。经过进一步调查，我发现有问题的字符来自“表情符号”块U+1F600 - U+1F64F 和“杂项符号和象形文字”块U+1F300 - U+1F5FF。我尝试删除，但没有成功，因为匹配器最终替换了字符串中的几乎每个字符，而不仅仅是我想要的 unicode 范围。

String utf8tweet = "";
        try {
            byte[] utf8Bytes = status.getText().getBytes("UTF-8");

            utf8tweet = new String(utf8Bytes, "UTF-8");

        } 
        catch (UnsupportedEncodingException e) {
            e.printStackTrace();
        }
Pattern unicodeOutliers = Pattern.compile("[\\u1f300-\\u1f64f]", Pattern.UNICODE_CASE | Pattern.CANON_EQ | Pattern.CASE_INSENSITIVE);
Matcher unicodeOutlierMatcher = unicodeOutliers.matcher(utf8tweet);
utf8tweet = unicodeOutlierMatcher.replaceAll(" ");

我该怎么做才能删除这些字符？

import java.io.UnsupportedEncodingException; import java.util.regex.Matcher; import java.util.regex.Pattern; public class UTF8 { public static void main(String[] args) { String utf8tweet = ""; try { byte[] utf8Bytes = "#Hello twitter  How are you?".getBytes("UTF-8"); utf8tweet = new String(utf8Bytes, "UTF-8"); } catch (UnsupportedEncodingException e) { e.printStackTrace(); } Pattern unicodeOutliers = Pattern.compile("[^\\x00-\\x7F]", Pattern.UNICODE_CASE | Pattern.CANON_EQ | Pattern.CASE_INSENSITIVE); Matcher unicodeOutlierMatcher = unicodeOutliers.matcher(utf8tweet); System.out.println("Before: " + utf8tweet); utf8tweet = unicodeOutlierMatcher.replaceAll(" "); System.out.println("After: " + utf8tweet); } }

class EmojiEraser{ private static final String EMOJI_RANGE_REGEX = "[\uD83C\uDF00-\uD83D\uDDFF]|[\uD83D\uDE00-\uD83D\uDE4F]|[\uD83D\uDE80-\uD83D\uDEFF]|[\u2600-\u26FF]|[\u2700-\u27BF]"; private static final Pattern PATTERN = Pattern.compile(EMOJI_RANGE_REGEX); /** * Finds and removes emojies from @param input * * @param input the input string potentially containing emojis (comes as unicode stringfied) * @return input string with emojis replaced */ public String eraseEmojis(String input) { if (Strings.isNullOrEmpty(input)) { return input; } Matcher matcher = PATTERN.matcher(input); StringBuffer sb = new StringBuffer(); while (matcher.find()) { matcher.appendReplacement(sb, ""); } matcher.appendTail(sb); return sb.toString(); } }

IT小君

在正则表达式模式中添加否定运算符^。对于过滤可打印字符，您可以使用以下表达式[^\\x00-\\x7F]，您应该得到所需的结果。

结果如下：

Before: #Hello twitter  How are you?
After: #Hello twitter   How are you?

编辑

为了进一步解释，您也可以继续用\u以下方式表达范围[^\\u0000-\\u007F]，它将匹配所有不是前 128 个 UNICODE 字符的字符（与之前相同）。如果您想扩展范围以支持额外的字符，您可以使用此处的 UNICODE 字符列表来实现。

例如，如果您想包含带重音的元音（在西班牙语中使用），您应该将范围扩展到\u00FF，因此您有[^\\u0000-\\u00FF]或[^\\x00-\\xFF]：

Before: #Hello twitter  How are you? á é í ó ú
After: #Hello twitter   How are you? á é í ó ú

2021-10-21T06:36:07 回复

首先，有关的 unicode 块在 java（严格遵循标准）中指定为Character.UnicodeBlock MISCELLANEOUS_SYMBOLS_AND_PICTOGRAPHS。在正则表达式中：

s = s.replaceAll("\\p{So}+", "");

我试过这个。Unicode 范围来自表情符号范围

假设status.getText()返回一个java.lang.String...

byte[] utf8Bytes = status.getText().getBytes("UTF-8");
utf8tweet = new String(utf8Bytes, "UTF-8");

上述转码操作产生与以下相同的结果：

utf8tweet = status.getText();

Java 字符串是隐式的 UTF-16。UTF-16 和 UTF-8 共享相同的字符集 (Unicode)，因此从一种转换到另一种并返回会产生原始数据。

Java 正则表达式支持使用代理对的补充范围。您可以按照此问题的答案中的说明匹配它们。

正如eee在他的评论中指出的那样，您很可能有字体问题。一个字素能否显示通常取决于用户系统上可用的字体、选择的字体以及渲染技术支持的字体替换形式。