Recommended (maximum) size data set

Noel · April 18, 2023, 3:21pm

Hello
I am currently using Textada for my bachelor thesis. First of all, thank you for making this possible for me!
Regarding the size of the dataset to be processed, I had originally planned to use data from 163 pages (86’580 words). This would then be split into 60% training, 20% validation & 20% test data. I have uploaded the dataset into Textada, but unfortunately the program becomes very slow when editing.
Would there be a way to prevent this or do I need to use a smaller data set?
If I should choose a smaller data set, what maximum size would you recommend?
Thanks for your support!

felix · April 19, 2023, 9:10am

Hey Noel,

glad to hear that textada may be helpful for you

Regarding the dataset size, across how many documents is the data split you’re referring to?

Noel · April 20, 2023, 8:48am

Hey Felix
Thanks for your quick Feedback!
At the moment all the data is stored in just one big document.
Could a subdivision into different documents solve the problem?

felix · April 20, 2023, 9:32am

Hey Noel,

thanks for the response. What kind of documents are that, e.g., would it be feasible to split the large document into smaller parts?

We are currently working on increasing the maximum document size in our editor. But until the implementation of this features is finished, I suggest that you split up your large document into multiple smaller ones. For example, I would go with roughly up to 5-10 pages for each document, that should be fine. Please let me know if that works for you

Best,
Felix

Noel · April 20, 2023, 10:16am

Hey Felix
I work with transcripts of interviews. It should not be a problem to split the data set into different documents.
Thanks for your help, I’ll get back to you if I have another problem.
All the best
Noel