Getting started - HeroTypes
HeroTypes
In Texthero, we're always working with Pandas Series and Pandas Dataframes to gain insights from text data! To make things easier and more intuitive, we differentiate between different types of Series/DataFrames, depending on where we are on the road to understanding our dataset.
Overview
When working with text data, it is easy to get overwhelmed by the many different functions that can be applied to the data. We want to make the whole journey as clear as possible. For example, when we start working with a new dataset, we usually want to do some preprocessing first. At the beginning, the data is in a DataFrame or Series where every document is one string. It might look like this:
                                    text
document_id                             
0            "Text in the first document"
1            "Text in the second document"
2            "Text in the third document"
3            "Text in the fourth document"
4                                    ...
Consequently, in Texthero's preprocessing module, the functions usually take as input a Series where every cell is a string, and return as output a Series where every cell is a string. We will call this kind of Series TextSeries, so users know immediately what kind of Series the functions can work on. For example, you might see a function
remove_punctuation(s: TextSeries) -> TextSeries
in the documentation. You then know that this function takes as input a TextSeries and returns as output a TextSeries, so it can be used in the preprocessing phase of your work, where each document is one string.
The HeroSeries Types
These are the three types currently supported by the library; almost all of the libraries functions takes as input and return as output one of these types:
- TextSeries: Every cell is a text, i.e. a string. For example, - pd.Series(["test", "test"])is a valid TextSeries.
- TokenSeries: Every cell is a list of words/tokens, i.e. a list of strings. For example, - pd.Series([["test"], ["token2", "token3"]])is a valid TokenSeries.
- VectorSeries: Every cell is a vector representing text, i.e. a list of floats. For example, - pd.Series([[1.0, 2.0], [3.0, 4.0]])is a valid VectorSeries.
Additionally, sometimes Texthero functions (most that accept a
VectorSeries as input) also accept a Pandas DataFrame
as input that is representing a matrix. Every cell value
is then one entry in the matrix. An example is
pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=["word1", "word2", "word3"]).
Now, if you see a function in the documentation that looks like this:
tfidf(s: TokenSeries) -> DataFrame
then you know that the function takes a Pandas Series whose cells are lists of strings (tokens) and will return a Pandas DataFrame representing a matrix (in this case a Document-Term-Matrix ). You might call it like this:
>>> import texthero as hero
>>> import pandas as pd
>>> s = pd.Series(["Text of first document", "Text of second document"])
>>> df_tfidf = s.pipe(hero.tokenize).pipe(hero.tfidf)
>>> df_tfidf
   Text  document     first   of    second
0   1.0       1.0  1.405465  1.0  0.000000
1   1.0       1.0  0.000000  1.0  1.405465
And this function:
pca(s: Union[VectorSeries, DataFrame]) -> VectorSeries
needs a DataFrame or VectorSeries as input and always returns a VectorSeries.
The Types in Detail
We'll now have a closer look at each of the types and learn where they are used in a typical NLP workflow.
TextSeries
In a TextSeries, every cell is a string. As we saw at the beginning of this tutorial, this type is mostly used in preprocessing. It is very simple and allows us to easily clean a text dataset. Additionally, many NLP functions such as named_entities, noun_chunks, pos_tag take a TextSeries as input.
Example of a function that takes and returns a TextSeries:
>>> s = pd.Series(["Text: of first! document", "Text of second ... document"])
>>> hero.clean(s)
0     text first document
1    text second document
dtype: object
TokenSeries
In a TokenSeries, every cell is a list of words/tokens. We use this to prepare our data for representation, so to gain insights from it through mathematical methods. This is why the functions that initially transform your documents to vectors, namely tfidf, term_frequency, count, take a TokenSeries as input.
Example of a function that takes a TextSeries and returns a TokenSeries:
>>> s = pd.Series(["text first document", "text second document"])
>>> hero.tokenize(s)
0     [text, first, document]
1    [text, second, document]
dtype: object
VectorSeries
In a VectorSeries, every cell is a vector representing text. We use this when we have a low-dimensional (e.g. vectors with length <=1000), dense (so not a lot of zeroes) representation of our texts that we want to work on. For example, the dimensionality reduction functions pca, nmf, tsne all take a high-dimensional representation of our text (in the form of a DataFrame (see below) or VectorSeries, and return a low-dimensional representation of our text in the form of a VectorSeries.
Example of a function that takes as input a DataFrame or VectorSeries and returns a VectorSeries:
>>> s = pd.Series(["text first document", "text second document"]).pipe(hero.tokenize).pipe(hero.term_frequency)
>>> hero.pca(s)
0     [0.118, 0.0]
1    [-0.118, 0.0]
dtype: object
DataFrame
In Natural Language Processing, we are often working with matrices that contain information about our dataset. For example, the output of the functions tfidf, count, and term_frequency is a Document-Term Matrix, i.e. a matrix where each row is one document and each column is one term / word.
We use a Pandas DataFrame for this for two reasons:
- It looks nice.
- It can be sparse.
The second reason is worth explaining in more detail: In e.g. a big Document-Term Matrix, we might have 10,000 different terms, so 10,000 columns in our DataFrame. Additionally, most documents will only contain a small subset of all the terms. Thus, in each row, there will be lots of zeros in our matrix. This is why we use a sparse matrix: A sparse matrix only stores the non-zero fields. And Pandas DataFrames support sparse data, so Texthero users fully profit from the sparseness!
This is a massive advantage when dealing with big datasets: In a sparse DataFrame, we only store the data that's relevant to save lots and lots of time and space!
Let's look at an example with some more data.
>>> data = pd.read_csv("https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv")
>>> data_count = data["text"].pipe(hero.count)
>>> data_count
     !   "  "'  ",  #  $  %  ...  £62m  £6m  £70m  £7m  £7million  £80,000  £8m
0    0   5   0   0  0  0  0  ...     0    0     0    0          0        0    0
1    0   0   0   0  0  0  0  ...     0    0     0    0          0        0    0
2    0  14   0   0  0  0  0  ...     0    0     0    0          0        0    0
3    0  10   0   0  0  0  0  ...     0    0     0    0          0        0    0
4    0   4   0   0  0  0  0  ...     0    0     0    0          0        0    0
..  ..  ..  ..  .. .. .. ..  ...   ...  ...   ...  ...        ...      ...  ...
732  0   2   0   0  0  0  2  ...     0    0     0    0          0        0    0
733  0   6   0   0  0  0  0  ...     0    0     0    0          0        0    0
734  0   5   0   0  0  0  0  ...     0    0     0    0          0        0    0
735  0  14   0   0  0  0  0  ...     0    0     0    0          0        0    0
736  0   6   0   0  0  0  0  ...     0    0     0    0          0        0    0
>>> data_count.sparse.density
0.010792808715706939
We can see that only around 1% of our DataFrame data_count is filled with non-zero values, so using the sparse DataFrame is saving us a lot of space.