kspoon is a Kotlin Multiplatform library for parsing HTML into Kotlin objects. It uses ksoup as an HTML parser and kotlinx.serialization to create objects. This library is a successor to jspoon.
A big shoutout to @itboy87 for porting Jsoup to KMP - this library wouldn't exist without his amazing work. Check out the Ksoup repository!
Apply serialization plugin to your module build.gradle.kts
/build.gradle
:
plugins {
kotlin("plugin.serialization") version "<kotlin version>"
}
Add the following dependency to your module build.gradle.kts
/build.gradle
file:
dependencies {
implementation("dev.burnoo.kspoon:kspoon:0.1.2")
}
This library uses
the lightweight
variant of Ksoup. If you plan to use other variants of the Ksoup library within the same project, you may need to
replace :ksoup
with
your preferred variant by using
Gradle's dependency substitution.
kspoon works with any serializable class. Adding @Selector
annotations on its serializable fields, enables HTML
parsing:
@Serializable
data class Page(
@Selector("#header") val header: String,
@Selector("li.class1") val intList: List<Int>,
@Selector(value = "#image1", attr = "src") val imageSource: String,
)
You can then use a Kspoon
instance to create objects:
val htmlContent = """<div>
<p id='header'>Title</p>
<ul>
<li class='class1'>1</li>
<li>2</li>
<li class='class1'>3</li>
</ul>
<img id='image1' src='image.bmp' />
</div>""".trimIndent()
val page = Kspoon.parse<Page>(htmlContent)
println(page) // Page(header=Title, intList=[1, 3], imageSource=image.bmp)
The library looks for the first occurrence with CSS selector in the HTML and sets its value to the corresponding field.
kspoon can be configured using the Kspoon {}
factory function, which returns an instance that can be used for parsing.
All available options with default values are listed below:
val kspoon = Kspoon {
// Specifies the parsing function. Type: (String) -> Document
parse = { html: String -> Ksoup.parse(html, baseUri = "") }
// Default text mode used for parsing.
defaultTextMode = HtmlTextMode.Text
// Enables coercing values when the selected HTML element is not found.
coerceInputValues = false
// Module with contextual and polymorphic serializers to be used.
serializersModule = EmptySerializersModule()
}
kspoon.parse(HTML_CONTENT)
By default, the HTML's textContent
value is used to extract data. This behavior can be changed either in the
configuration or by using the textMode
parameter in the @Selector
annotation. Options include InnerHtml
,
OuterHtml
, or Data
(for scripts and styles):
@Serializable
data class Page(
@Selector("p", textMode = SelectorHtmlTextMode.OuterHtml)
val content: String
)
val htmlContent = "<p><span>Text</span></p>"
val page = Kspoon.parse<Page>(htmlContent)
println(page) // Page(content=<p><span>Text</span></p>)
It is also possible to get an attribute value by setting the attr
parameter in the @Selector
annotation (
see Usage for an example).
Regex can be set up by passing the regex
parameter to the @Selector
annotation. After parsing the text (with HTML
text mode or attribute), the regex is applied to the string. The returned string will be the first matched group or the
entire match if no group is specified.
data class Page(
@Selector(value = "#numbers", regex = "([0-9]+) ")
val starNumber: Int // <span id="numbers">31 stars</span> (31 will be parsed)
)
There are three ways to set default values:
@Selector("#tag", defValue = "default")
- if the HTML element is not found, thedefValue
will be used as a parsed string- Nullable field - if the HTML element is not found, the value will be set to
null
coerceInputValues = true
in theKspoon {}
configuration - enables coercing to a default value@Serializable data class Model( @Selector("span") val text: String = "not found" ) val body = "<p></p>" val text = Kspoon { coerceInputValues = true }.parse<Model>(body).text println(text) // prints "not found"
defValue
offers the best performance due to the internal logic of kotlinx.serialization. Nullable fields does HTML
selection twice. Coercing input values does HTML selection twice and also disables
sequential decoding.
Any KSerializer
can be applied to a field annotated with @Selector
to customize serialization logic. For example,
date serializers from kotlinx-datetime
:
@Serializable
data class Model(
@Serializable(LocalDateIso8601Serializer::class)
@Selector("span")
val date: LocalDate,
)
Additionally, kspoon has built-in serializers for Ksoup classes: ElementSerializer
, ElementsSerializer
, and
DocumentSerializer
. They can be used directly or via contextual serialization:
@Serializable
data class Model(
@Serializable(ElementSerializer::class) // or @Contextual
@Selector("div.class1")
val element: Element,
)
It is also possible to write custom kspoon serializers that can access the selected Element
. Read
more here.
The Kspoon
class has a toFormat(): StringFormat
function that can be used with third-party libraries. For detailed
integration instructions, see the following links:
jvm
, js
, wasmjs
linuxX64
, linuxArm64
, tvosArm64
, tvosX64
, tvosSimulatorArm64
, macosX64
,
macosArm64
, iosArm64
, iosSimulatorArm64
, iosX64
, mingwX64
See GitHub releases.