The invention relates, in an embodiment, to a computer-implemented method
for automatic charset detection, which includes detecting an encoding
scheme of a target document. The method includes training, using a
plurality of text document samples, to obtain a set of machine learning
models. Training includes using a SVM (Support Vector Machine) technique
to generate the set of machine learning models from feature vectors
obtained from the plurality of text document samples. The method also
includes applying the set of machine learning models against a set of
target document feature vectors converted from the target document to
detect the encoding scheme.