-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for English #41
Open
ZeekYin
wants to merge
7
commits into
WorksApplications:main
Choose a base branch
from
ZeekYin:main
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
15c9a69
params changed for English
ZeekYin ddc2454
add support for ascii char
ZeekYin 2d2a167
Update LangEstimation.scala
ZeekYin c6058f8
1
ZeekYin 99887b0
english detectable
ZeekYin 572d0a6
start judge from 50%~
ZeekYin c1f9214
Changed estimation method
ZeekYin File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
<?xml version="1.0" encoding="UTF-8"?> | ||
<module type="JAVA_MODULE" version="4"> | ||
<component name="NewModuleRootManager" inherit-compiler-output="true"> | ||
<exclude-output /> | ||
<content url="file://$MODULE_DIR$"> | ||
<sourceFolder url="file://$MODULE_DIR$/src/main/java" isTestSource="false" /> | ||
</content> | ||
<orderEntry type="inheritedJdk" /> | ||
<orderEntry type="sourceFolder" forTests="false" /> | ||
</component> | ||
</module> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
<?xml version="1.0" encoding="UTF-8"?> | ||
<module type="JAVA_MODULE" version="4"> | ||
<component name="NewModuleRootManager" inherit-compiler-output="true"> | ||
<exclude-output /> | ||
<content url="file://$MODULE_DIR$"> | ||
<sourceFolder url="file://$MODULE_DIR$/src/main/scala" isTestSource="false" /> | ||
</content> | ||
<orderEntry type="inheritedJdk" /> | ||
<orderEntry type="sourceFolder" forTests="false" /> | ||
</component> | ||
</module> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
60 changes: 52 additions & 8 deletions
60
lib/src/test/scala/com/worksap/nlp/uzushio/lib/lang/LangEstimationSpec.scala
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,15 +1,59 @@ | ||
package com.worksap.nlp.uzushio.lib.lang | ||
|
||
import com.worksap.nlp.uzushio.lib.utils.ClasspathAccess | ||
import java.nio.charset.{Charset, StandardCharsets} | ||
import org.scalatest.freespec.AnyFreeSpec | ||
|
||
class LangEstimationSpec extends AnyFreeSpec with ClasspathAccess { | ||
class LangEstimationSpec extends AnyFreeSpec { | ||
|
||
"LangEstimation" - { | ||
val sniffer = new LangTagSniffer() | ||
"sniffs charset shift_jis fragment" in { | ||
val data = classpathBytes("lang/shift_jis.txt") | ||
val tags = sniffer.sniffTags(data, 0, data.length) | ||
assert("Shift-JIS" == tags.charset) | ||
val estimator = new LangEstimation() | ||
|
||
"detects Japanese language from a simulated Wikipedia page about Japan" in { | ||
// 模拟维基百科介绍日本的 HTML 页面,并用日语书写,Shift-JIS 编码 | ||
val htmlContent = """ | ||
<html> | ||
<head> | ||
<title>日本 - Wikipedia</title> | ||
</head> | ||
<body> | ||
<h1>日本</h1> | ||
<p>日本(にっぽん、にほん)は、東アジアに位置する島国で、太平洋に面しています。日本は北海道、本州、四国、九州の四つの主要な島から構成されています。</p> | ||
<p>日本の首都は東京で、人口は世界でも有数の規模を誇ります。日本は高度に発展した国であり、技術、経済、文化など多くの分野で世界に影響を与えています。</p> | ||
<p>日本の歴史は古く、何世紀にもわたる様々な変革と発展を遂げてきました。現代の日本は、明治維新後に急速に産業化され、世界的な経済大国となりました。</p> | ||
<p>第二次世界大戦後、日本は驚異的な復興を遂げ、現在では世界で最も強力な経済の一つとして知られています。</p> | ||
</body> | ||
</html> | ||
""" | ||
val data = htmlContent.getBytes("Shift_JIS") | ||
val result = estimator.estimateLang(data, 0, Charset.forName("Shift_JIS")) | ||
|
||
// 断言检测结果应该是日语 | ||
assert(result.isInstanceOf[ProbableLanguage]) | ||
assert(result.asInstanceOf[ProbableLanguage].lang == "ja") // 期待的结果是日语 | ||
} | ||
|
||
"detects English language from a simulated Wikipedia page about Japan" in { | ||
// 模拟维基百科关于日本的英文页面,并用 UTF-8 编码 | ||
val htmlContent = """ | ||
<html> | ||
<head> | ||
<title>Japan - Wikipedia</title> | ||
</head> | ||
<body> | ||
<h1>Japan</h1> | ||
<p>Japan is an island country in East Asia, located in the northwest Pacific Ocean. It borders the Sea of Japan to the west, and extends from the Sea of Okhotsk in the north to the East China Sea and Taiwan in the south.</p> | ||
<p>Japan is a highly developed country, known for its advanced technology, strong economy, and rich culture. With a population of over 125 million, Japan is the world's eleventh most populous country, and Tokyo, its capital, is one of the most populous cities in the world.</p> | ||
<p>The country's history dates back to the 14th century BC, and over the centuries, it has evolved through various dynasties and periods. Modern Japan emerged in the late 19th century during the Meiji Restoration, which transformed it into an industrial and economic power.</p> | ||
<p>After World War II, Japan experienced rapid recovery and became one of the world's leading economies. Today, Japan is known for its influence in global technology, culture, and economy.</p> | ||
</body> | ||
</html> | ||
""" | ||
val data = htmlContent.getBytes(StandardCharsets.UTF_8) | ||
val result = estimator.estimateLang(data, 0, StandardCharsets.UTF_8) | ||
|
||
// 断言检测结果应该是英语 | ||
assert(result.isInstanceOf[ProbableLanguage]) | ||
assert(result.asInstanceOf[ProbableLanguage].lang == "en") // 期待的结果是英语 | ||
} | ||
} | ||
} | ||
} |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible to avoid creating this string completely, you do not need it.
https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/regex/Pattern.html#matcher(java.lang.CharSequence) can use CharBuffers directly as inputs as they implement CharSequence interface.