For R: How to exclude some data files based on file language












6















I'm rather new to R (and in law school so this is all very new to me), so apologies if this is poorly worded.
I have a series of about 1500 documents that I am importing into R to categorize and analyze later. The first thing that I need to do is exclude all documents that are written in French, which are labelled with an "FR" in the title/doc.info. I was curious what kind of code I could use to exclude that before importing the files to have a clean data set before analyzing anything (since it will obvious make a mess of processes like sentiment analysis).
Any help is appreciated (even if that help is explaining how to better talk about coding).
Kind regards!



edit 1
The code that I am using is readtext(folder), which you can see below:
folder<-"C:/[pathway]"
submissions<-readtext(folder)



submissions_text<-submissions$text

submission_number<- numeric()
submission_person<- factor()
submission_code<- factor()
submission_language<-factor()
submission_location<-factor()

for (submission_name in submissions$doc_id) {
submission_name<-gsub(".txt","",submission_name)
number<-as.numeric(strsplit(submission_name, "_|-")[[1]][1])
submission_number<-c(submission_number,number)
person<-strsplit(submission_name, "_")[[1]][2]
submission_person<-c(submission_person, person)
code<-strsplit(submission_name, "_")[[1]][3]
submission_code<-c(submission_code, code)
lang<-strsplit(submission_name, "_")[[1]][4]
submission_language<-c(submission_language, lang)
location<-strsplit(submission_name, "_")[[1]][5]
submission_location<-c(submission_location, location)
}

submissions<-cbind(submissions,submission_number)
submissions<-cbind(submissions,submission_person)
submissions<-cbind(submissions,submission_code)
submissions<-cbind(submissions,submission_language)
submissions<-cbind(submissions,submission_location)


submissions<-submissions[order(submissions$submission_number, decreasing = FALSE),]


This is just the organizational aspect of my code. I am looking to hopefully exclude all of the French data before this point (but if it comes afterward, I would also be more than happy with that).










share|improve this question

























  • This is a really broad question - probably too broad to be easily answerable here. Have you taken attempts to write any code, for example just to import all your documents? SO is not a code-writing service, so it's better to show some attempt (even if you're not confident in it) and then people can point you in better directions.

    – phalteman
    Jan 19 at 0:30











  • Sure. I can show you what I currently have. folder<-"C:/[path]" submissions<-readtext(folder) submissions<-mutate(submissions, order = 1:182) submissions<-submissions %>% select(order, doc_id:text) #This will be for the first number, the ordering submission_number<- numeric() submission_person<- factor() submission_code<- factor() submission_language<-factor() submission_location<-factor() for (submission_name in submissions$doc_id) { submission_name<-gsub(".txt","",submission_name) [...] it goes on but my character count is too long.

    – televised-god
    Jan 19 at 0:34








  • 1





    You can put this as an edit into your question, so that others can give your code a try (though it looks like you've got a good answer already!). Keep it mind that you'll get better help if you include a minimal example.

    – phalteman
    Jan 19 at 0:42











  • My bad. Again, all new to me. Thanks for the patience!

    – televised-god
    Jan 19 at 0:46
















6















I'm rather new to R (and in law school so this is all very new to me), so apologies if this is poorly worded.
I have a series of about 1500 documents that I am importing into R to categorize and analyze later. The first thing that I need to do is exclude all documents that are written in French, which are labelled with an "FR" in the title/doc.info. I was curious what kind of code I could use to exclude that before importing the files to have a clean data set before analyzing anything (since it will obvious make a mess of processes like sentiment analysis).
Any help is appreciated (even if that help is explaining how to better talk about coding).
Kind regards!



edit 1
The code that I am using is readtext(folder), which you can see below:
folder<-"C:/[pathway]"
submissions<-readtext(folder)



submissions_text<-submissions$text

submission_number<- numeric()
submission_person<- factor()
submission_code<- factor()
submission_language<-factor()
submission_location<-factor()

for (submission_name in submissions$doc_id) {
submission_name<-gsub(".txt","",submission_name)
number<-as.numeric(strsplit(submission_name, "_|-")[[1]][1])
submission_number<-c(submission_number,number)
person<-strsplit(submission_name, "_")[[1]][2]
submission_person<-c(submission_person, person)
code<-strsplit(submission_name, "_")[[1]][3]
submission_code<-c(submission_code, code)
lang<-strsplit(submission_name, "_")[[1]][4]
submission_language<-c(submission_language, lang)
location<-strsplit(submission_name, "_")[[1]][5]
submission_location<-c(submission_location, location)
}

submissions<-cbind(submissions,submission_number)
submissions<-cbind(submissions,submission_person)
submissions<-cbind(submissions,submission_code)
submissions<-cbind(submissions,submission_language)
submissions<-cbind(submissions,submission_location)


submissions<-submissions[order(submissions$submission_number, decreasing = FALSE),]


This is just the organizational aspect of my code. I am looking to hopefully exclude all of the French data before this point (but if it comes afterward, I would also be more than happy with that).










share|improve this question

























  • This is a really broad question - probably too broad to be easily answerable here. Have you taken attempts to write any code, for example just to import all your documents? SO is not a code-writing service, so it's better to show some attempt (even if you're not confident in it) and then people can point you in better directions.

    – phalteman
    Jan 19 at 0:30











  • Sure. I can show you what I currently have. folder<-"C:/[path]" submissions<-readtext(folder) submissions<-mutate(submissions, order = 1:182) submissions<-submissions %>% select(order, doc_id:text) #This will be for the first number, the ordering submission_number<- numeric() submission_person<- factor() submission_code<- factor() submission_language<-factor() submission_location<-factor() for (submission_name in submissions$doc_id) { submission_name<-gsub(".txt","",submission_name) [...] it goes on but my character count is too long.

    – televised-god
    Jan 19 at 0:34








  • 1





    You can put this as an edit into your question, so that others can give your code a try (though it looks like you've got a good answer already!). Keep it mind that you'll get better help if you include a minimal example.

    – phalteman
    Jan 19 at 0:42











  • My bad. Again, all new to me. Thanks for the patience!

    – televised-god
    Jan 19 at 0:46














6












6








6








I'm rather new to R (and in law school so this is all very new to me), so apologies if this is poorly worded.
I have a series of about 1500 documents that I am importing into R to categorize and analyze later. The first thing that I need to do is exclude all documents that are written in French, which are labelled with an "FR" in the title/doc.info. I was curious what kind of code I could use to exclude that before importing the files to have a clean data set before analyzing anything (since it will obvious make a mess of processes like sentiment analysis).
Any help is appreciated (even if that help is explaining how to better talk about coding).
Kind regards!



edit 1
The code that I am using is readtext(folder), which you can see below:
folder<-"C:/[pathway]"
submissions<-readtext(folder)



submissions_text<-submissions$text

submission_number<- numeric()
submission_person<- factor()
submission_code<- factor()
submission_language<-factor()
submission_location<-factor()

for (submission_name in submissions$doc_id) {
submission_name<-gsub(".txt","",submission_name)
number<-as.numeric(strsplit(submission_name, "_|-")[[1]][1])
submission_number<-c(submission_number,number)
person<-strsplit(submission_name, "_")[[1]][2]
submission_person<-c(submission_person, person)
code<-strsplit(submission_name, "_")[[1]][3]
submission_code<-c(submission_code, code)
lang<-strsplit(submission_name, "_")[[1]][4]
submission_language<-c(submission_language, lang)
location<-strsplit(submission_name, "_")[[1]][5]
submission_location<-c(submission_location, location)
}

submissions<-cbind(submissions,submission_number)
submissions<-cbind(submissions,submission_person)
submissions<-cbind(submissions,submission_code)
submissions<-cbind(submissions,submission_language)
submissions<-cbind(submissions,submission_location)


submissions<-submissions[order(submissions$submission_number, decreasing = FALSE),]


This is just the organizational aspect of my code. I am looking to hopefully exclude all of the French data before this point (but if it comes afterward, I would also be more than happy with that).










share|improve this question
















I'm rather new to R (and in law school so this is all very new to me), so apologies if this is poorly worded.
I have a series of about 1500 documents that I am importing into R to categorize and analyze later. The first thing that I need to do is exclude all documents that are written in French, which are labelled with an "FR" in the title/doc.info. I was curious what kind of code I could use to exclude that before importing the files to have a clean data set before analyzing anything (since it will obvious make a mess of processes like sentiment analysis).
Any help is appreciated (even if that help is explaining how to better talk about coding).
Kind regards!



edit 1
The code that I am using is readtext(folder), which you can see below:
folder<-"C:/[pathway]"
submissions<-readtext(folder)



submissions_text<-submissions$text

submission_number<- numeric()
submission_person<- factor()
submission_code<- factor()
submission_language<-factor()
submission_location<-factor()

for (submission_name in submissions$doc_id) {
submission_name<-gsub(".txt","",submission_name)
number<-as.numeric(strsplit(submission_name, "_|-")[[1]][1])
submission_number<-c(submission_number,number)
person<-strsplit(submission_name, "_")[[1]][2]
submission_person<-c(submission_person, person)
code<-strsplit(submission_name, "_")[[1]][3]
submission_code<-c(submission_code, code)
lang<-strsplit(submission_name, "_")[[1]][4]
submission_language<-c(submission_language, lang)
location<-strsplit(submission_name, "_")[[1]][5]
submission_location<-c(submission_location, location)
}

submissions<-cbind(submissions,submission_number)
submissions<-cbind(submissions,submission_person)
submissions<-cbind(submissions,submission_code)
submissions<-cbind(submissions,submission_language)
submissions<-cbind(submissions,submission_location)


submissions<-submissions[order(submissions$submission_number, decreasing = FALSE),]


This is just the organizational aspect of my code. I am looking to hopefully exclude all of the French data before this point (but if it comes afterward, I would also be more than happy with that).







r






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 19 at 0:51







televised-god

















asked Jan 18 at 23:50









televised-godtelevised-god

313




313













  • This is a really broad question - probably too broad to be easily answerable here. Have you taken attempts to write any code, for example just to import all your documents? SO is not a code-writing service, so it's better to show some attempt (even if you're not confident in it) and then people can point you in better directions.

    – phalteman
    Jan 19 at 0:30











  • Sure. I can show you what I currently have. folder<-"C:/[path]" submissions<-readtext(folder) submissions<-mutate(submissions, order = 1:182) submissions<-submissions %>% select(order, doc_id:text) #This will be for the first number, the ordering submission_number<- numeric() submission_person<- factor() submission_code<- factor() submission_language<-factor() submission_location<-factor() for (submission_name in submissions$doc_id) { submission_name<-gsub(".txt","",submission_name) [...] it goes on but my character count is too long.

    – televised-god
    Jan 19 at 0:34








  • 1





    You can put this as an edit into your question, so that others can give your code a try (though it looks like you've got a good answer already!). Keep it mind that you'll get better help if you include a minimal example.

    – phalteman
    Jan 19 at 0:42











  • My bad. Again, all new to me. Thanks for the patience!

    – televised-god
    Jan 19 at 0:46



















  • This is a really broad question - probably too broad to be easily answerable here. Have you taken attempts to write any code, for example just to import all your documents? SO is not a code-writing service, so it's better to show some attempt (even if you're not confident in it) and then people can point you in better directions.

    – phalteman
    Jan 19 at 0:30











  • Sure. I can show you what I currently have. folder<-"C:/[path]" submissions<-readtext(folder) submissions<-mutate(submissions, order = 1:182) submissions<-submissions %>% select(order, doc_id:text) #This will be for the first number, the ordering submission_number<- numeric() submission_person<- factor() submission_code<- factor() submission_language<-factor() submission_location<-factor() for (submission_name in submissions$doc_id) { submission_name<-gsub(".txt","",submission_name) [...] it goes on but my character count is too long.

    – televised-god
    Jan 19 at 0:34








  • 1





    You can put this as an edit into your question, so that others can give your code a try (though it looks like you've got a good answer already!). Keep it mind that you'll get better help if you include a minimal example.

    – phalteman
    Jan 19 at 0:42











  • My bad. Again, all new to me. Thanks for the patience!

    – televised-god
    Jan 19 at 0:46

















This is a really broad question - probably too broad to be easily answerable here. Have you taken attempts to write any code, for example just to import all your documents? SO is not a code-writing service, so it's better to show some attempt (even if you're not confident in it) and then people can point you in better directions.

– phalteman
Jan 19 at 0:30





This is a really broad question - probably too broad to be easily answerable here. Have you taken attempts to write any code, for example just to import all your documents? SO is not a code-writing service, so it's better to show some attempt (even if you're not confident in it) and then people can point you in better directions.

– phalteman
Jan 19 at 0:30













Sure. I can show you what I currently have. folder<-"C:/[path]" submissions<-readtext(folder) submissions<-mutate(submissions, order = 1:182) submissions<-submissions %>% select(order, doc_id:text) #This will be for the first number, the ordering submission_number<- numeric() submission_person<- factor() submission_code<- factor() submission_language<-factor() submission_location<-factor() for (submission_name in submissions$doc_id) { submission_name<-gsub(".txt","",submission_name) [...] it goes on but my character count is too long.

– televised-god
Jan 19 at 0:34







Sure. I can show you what I currently have. folder<-"C:/[path]" submissions<-readtext(folder) submissions<-mutate(submissions, order = 1:182) submissions<-submissions %>% select(order, doc_id:text) #This will be for the first number, the ordering submission_number<- numeric() submission_person<- factor() submission_code<- factor() submission_language<-factor() submission_location<-factor() for (submission_name in submissions$doc_id) { submission_name<-gsub(".txt","",submission_name) [...] it goes on but my character count is too long.

– televised-god
Jan 19 at 0:34






1




1





You can put this as an edit into your question, so that others can give your code a try (though it looks like you've got a good answer already!). Keep it mind that you'll get better help if you include a minimal example.

– phalteman
Jan 19 at 0:42





You can put this as an edit into your question, so that others can give your code a try (though it looks like you've got a good answer already!). Keep it mind that you'll get better help if you include a minimal example.

– phalteman
Jan 19 at 0:42













My bad. Again, all new to me. Thanks for the patience!

– televised-god
Jan 19 at 0:46





My bad. Again, all new to me. Thanks for the patience!

– televised-god
Jan 19 at 0:46












2 Answers
2






active

oldest

votes


















4














The functionality you are after can be found in the list.files() function. Documentation can be found here.



In short, your code will likely end up looking something like this:



setwd("c:/path/to/your/data/here")
files <- list.files()
non_french_files <- files[!grepl("FR", files)]
lapply(non_french_files, function(x) {
f <- read.csv(x)
#do stuff with f
}]


Note - you could directly leverage the pattern parameter found in `list.files(), but I chose to do that in two steps in case you wanted to do something else with the French files. This also simplifies what each line of code is doing...



...good luck and welcome to R!






share|improve this answer
























  • Thanks! I am going to give this a try. I have been using readtext so far (probably should have mentioned that, sorry), but I will try this list.files. Thanks again!

    – televised-god
    Jan 19 at 0:37











  • You can also specify a regex pattern in list.files() and potentially skip have to assign non_french_files().

    – dylanjm
    Jan 19 at 2:11



















1














Here's an alternative similar to @Chase 's:



#set wd
files<-list.files()[!grepl("FR",list.files())]
lapply(files,function(x) read.csv(x)) #reads all at once, might want to save each





share|improve this answer



















  • 1





    Thanks for the suggestion, you guys have been great!

    – televised-god
    Jan 19 at 22:20











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54262837%2ffor-r-how-to-exclude-some-data-files-based-on-file-language%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









4














The functionality you are after can be found in the list.files() function. Documentation can be found here.



In short, your code will likely end up looking something like this:



setwd("c:/path/to/your/data/here")
files <- list.files()
non_french_files <- files[!grepl("FR", files)]
lapply(non_french_files, function(x) {
f <- read.csv(x)
#do stuff with f
}]


Note - you could directly leverage the pattern parameter found in `list.files(), but I chose to do that in two steps in case you wanted to do something else with the French files. This also simplifies what each line of code is doing...



...good luck and welcome to R!






share|improve this answer
























  • Thanks! I am going to give this a try. I have been using readtext so far (probably should have mentioned that, sorry), but I will try this list.files. Thanks again!

    – televised-god
    Jan 19 at 0:37











  • You can also specify a regex pattern in list.files() and potentially skip have to assign non_french_files().

    – dylanjm
    Jan 19 at 2:11
















4














The functionality you are after can be found in the list.files() function. Documentation can be found here.



In short, your code will likely end up looking something like this:



setwd("c:/path/to/your/data/here")
files <- list.files()
non_french_files <- files[!grepl("FR", files)]
lapply(non_french_files, function(x) {
f <- read.csv(x)
#do stuff with f
}]


Note - you could directly leverage the pattern parameter found in `list.files(), but I chose to do that in two steps in case you wanted to do something else with the French files. This also simplifies what each line of code is doing...



...good luck and welcome to R!






share|improve this answer
























  • Thanks! I am going to give this a try. I have been using readtext so far (probably should have mentioned that, sorry), but I will try this list.files. Thanks again!

    – televised-god
    Jan 19 at 0:37











  • You can also specify a regex pattern in list.files() and potentially skip have to assign non_french_files().

    – dylanjm
    Jan 19 at 2:11














4












4








4







The functionality you are after can be found in the list.files() function. Documentation can be found here.



In short, your code will likely end up looking something like this:



setwd("c:/path/to/your/data/here")
files <- list.files()
non_french_files <- files[!grepl("FR", files)]
lapply(non_french_files, function(x) {
f <- read.csv(x)
#do stuff with f
}]


Note - you could directly leverage the pattern parameter found in `list.files(), but I chose to do that in two steps in case you wanted to do something else with the French files. This also simplifies what each line of code is doing...



...good luck and welcome to R!






share|improve this answer













The functionality you are after can be found in the list.files() function. Documentation can be found here.



In short, your code will likely end up looking something like this:



setwd("c:/path/to/your/data/here")
files <- list.files()
non_french_files <- files[!grepl("FR", files)]
lapply(non_french_files, function(x) {
f <- read.csv(x)
#do stuff with f
}]


Note - you could directly leverage the pattern parameter found in `list.files(), but I chose to do that in two steps in case you wanted to do something else with the French files. This also simplifies what each line of code is doing...



...good luck and welcome to R!







share|improve this answer












share|improve this answer



share|improve this answer










answered Jan 19 at 0:21









ChaseChase

49.3k12117152




49.3k12117152













  • Thanks! I am going to give this a try. I have been using readtext so far (probably should have mentioned that, sorry), but I will try this list.files. Thanks again!

    – televised-god
    Jan 19 at 0:37











  • You can also specify a regex pattern in list.files() and potentially skip have to assign non_french_files().

    – dylanjm
    Jan 19 at 2:11



















  • Thanks! I am going to give this a try. I have been using readtext so far (probably should have mentioned that, sorry), but I will try this list.files. Thanks again!

    – televised-god
    Jan 19 at 0:37











  • You can also specify a regex pattern in list.files() and potentially skip have to assign non_french_files().

    – dylanjm
    Jan 19 at 2:11

















Thanks! I am going to give this a try. I have been using readtext so far (probably should have mentioned that, sorry), but I will try this list.files. Thanks again!

– televised-god
Jan 19 at 0:37





Thanks! I am going to give this a try. I have been using readtext so far (probably should have mentioned that, sorry), but I will try this list.files. Thanks again!

– televised-god
Jan 19 at 0:37













You can also specify a regex pattern in list.files() and potentially skip have to assign non_french_files().

– dylanjm
Jan 19 at 2:11





You can also specify a regex pattern in list.files() and potentially skip have to assign non_french_files().

– dylanjm
Jan 19 at 2:11













1














Here's an alternative similar to @Chase 's:



#set wd
files<-list.files()[!grepl("FR",list.files())]
lapply(files,function(x) read.csv(x)) #reads all at once, might want to save each





share|improve this answer



















  • 1





    Thanks for the suggestion, you guys have been great!

    – televised-god
    Jan 19 at 22:20
















1














Here's an alternative similar to @Chase 's:



#set wd
files<-list.files()[!grepl("FR",list.files())]
lapply(files,function(x) read.csv(x)) #reads all at once, might want to save each





share|improve this answer



















  • 1





    Thanks for the suggestion, you guys have been great!

    – televised-god
    Jan 19 at 22:20














1












1








1







Here's an alternative similar to @Chase 's:



#set wd
files<-list.files()[!grepl("FR",list.files())]
lapply(files,function(x) read.csv(x)) #reads all at once, might want to save each





share|improve this answer













Here's an alternative similar to @Chase 's:



#set wd
files<-list.files()[!grepl("FR",list.files())]
lapply(files,function(x) read.csv(x)) #reads all at once, might want to save each






share|improve this answer












share|improve this answer



share|improve this answer










answered Jan 19 at 11:01









NelsonGonNelsonGon

2,069622




2,069622








  • 1





    Thanks for the suggestion, you guys have been great!

    – televised-god
    Jan 19 at 22:20














  • 1





    Thanks for the suggestion, you guys have been great!

    – televised-god
    Jan 19 at 22:20








1




1





Thanks for the suggestion, you guys have been great!

– televised-god
Jan 19 at 22:20





Thanks for the suggestion, you guys have been great!

– televised-god
Jan 19 at 22:20


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54262837%2ffor-r-how-to-exclude-some-data-files-based-on-file-language%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

How fix org.hibernate.TransientPropertyValueException

Updating UILabel text programmatically using a function

Cloud Functions - OpenCV Videocapture Read method fails for larger files from cloud storage