Patterns in American Indian Stories
This heading as actually slightly misleading. The goal of the project originally was to look for a pattern between any of the texts provided in our sample, but then I decided against it. I looked at each source and decided which ones I would use and decided to go with American Indian Stories by Zitkala-Ša.
Introduction
Regardless of the text selection, I knew it would involve Project Gutenberg. I decided to look for the terms of conditions of using texts from Project Gutenberg or if there were any things I needed to know. It seems that as long as I am not mass botting the downloading of books, selling, or doing anything commercial with the texts, I should be in the clear to use them without issues.
Before starting:
Based on one of our previous assignments, I had an idea of what the data cleaning would look like. In the process of “Data Cleaning” this mostly is just deleting any text at the end or start of the text before it begins. I want to compile all of the semantically meaningful words of the used text(s). Including the blocks of text at the start and end of each of these files, which mostly just explain what Project Gutenberg is, would fill up my sample of data with irrelevant text that will skew the data. I also need to pay attention to the fact that two of the three sources are translated from another language, if I compare texts against each other. If I look at a text in isolation, I don’t think it is as important to keep note of.
I could have compared all three texts, but I considered a thought: “do I want to use data just because I have it?” One or two texts should work fine unless I specifically want to use all three texts. I shouldn’t use all the three just for the sake of it.
Which texts to choose?
I made sure I got a bit of background information on all the texts just to make sure my choice for comparing the texts makes sense. This goes along the lines of (as previously mentioned), if it is translated, but also includes details such as length and formatting. I won’t go into too much detail, but the combinations that made sense to me were: analyzing Confucius Analects and American Indian Stories together, looking at any of the texts individually, or looking at American Indian Stories and comparing it against the Journals of Lewis and Clark. I was leaning towards the last option since Native Americans are present in both but decided against it. I think if I want to draw a meaningful connection, I would need to have a lot more knowledge on Native American history and that sort of project is outside of the scope of what I want to achieve. Additionally, the Journals of Lewis and Clark is a really large amount of text compared to the other two. When counting relative frequency of words, that shouldn’t be an issue. However, there may be other problems that show up that I simply can’t foresee at this point in time.
Process
I have given sufficient information about the sources and why I chose one specific text. As mentioned earlier with the process for cleaning up the text, I removed things like the table of contents but left all of the titles of each individual story. The text I selected was American Indian Stories. It is a compilation of stories by Zitkala-Ša. When I cleaned up the document, I got rid of just over 10% of the text. That much text being irrelevant would mess up my visualizations. I learned from the project earlier in the term that leaving this extra stuff would result in the words “project” and “gutenburg” to be near the top, which obviously has nothing to do with American Indian Stories.
Voyant Tools was my tool of choice for this project. We used it earlier in the term. Once I upload my corpus, I can remove recurring semantically meaningless words from it. Removing large chunks from the beginning and end should be done before uploading.
The thought process for me was:
Remove the massive chunks at the start and end that I know for sure will have no meaning. No point in uploading stuff that is already isolated, I can just trim it down. After analyzing the text in Voyant Tools, semantically meaningless information will pop up and these individual cases can be removed from there.
I was quite pleased to get something that looks more representative of the text and that I can be confident does not have “filler” or text artifacts from the original .txt file.
Significance and Limitations
The words “mother” and “women” as well as terms relating to family are important in these texts. Without the necessary background information, we can’t instantly assume that this means family is important to Native Americans. There are a lot of questions I ask myself. Maybe someone who is more knowledgeable on the topic would be able to make that conclusion. I don’t feel I am qualified to automatically assume that the significance of any themes beyond the source itself.
So based on this immediate analysis, I can confidently say that family is an important theme in these texts. Even if I know family is really important to Native Americans using background knowledge, this source alone wouldn’t be enough to draw that conclusion.
That is one of major limitations I feel I have run into when doing something like this. I may have the correct conclusion, but it may or may not be justified to claim it with the given information.
This leads into the discussion of what a digital humanist is and how to be a responsible one, by not misrepresenting data or drawing weak/faulty conclusions, whether intentional or not.
The embedded example with mother and father is actually not the full picture. I could pass it off as such with no issues if I wanted to, but I should explain that the asterisks are there because the term “father*” is including “grandfather” and “mother*” includes “grandmother”. There could be ways someone uses the tools given to them, and sneakily hides information.
Final Thoughts
This project helped me sort of put the pieces in my mind of what a Digital Humanities project would feel like. It isn’t just something I can put together quickly. I have to consider the implications of my decisions, what is reasonable for me to say, and know what tools I can use. I want to be credible and, more importantly, avoid intentionally misrepresenting things. Even if I am one person, major mistakes I make can trickle down, sort of like a cascade of misinformation caused by a small oversight.