Importing a fixed-width file into R when the variables are defined in another file
I'm trying to import this data into R.
https://www.cdc.gov/healthyyouth/data/yrbs/data.htm
I know I need the survey package, but these files are odd.
Anyone know what to do?
r sas spss
add a comment |
I'm trying to import this data into R.
https://www.cdc.gov/healthyyouth/data/yrbs/data.htm
I know I need the survey package, but these files are odd.
Anyone know what to do?
r sas spss
Hi Chris, the files (*.dat) are fixed width text files, you have to identify the position of the columns and specify them. See stackoverflow.com/questions/14383710/read-fixed-width-text-file
– Khaynes
Jan 19 at 3:35
:O there are 216 letter columns in this dataset. Edit: I’m sorry, 427.
– Chris
Jan 19 at 4:43
I get 314 in thesadc_2017_national.dat
dataset? See answer that I posted.
– Khaynes
Jan 19 at 5:35
Ahhh i see what it is..I was using the National Dataset from 2017. But actually I’ll use the complete dataset.
– Chris
Jan 19 at 13:49
add a comment |
I'm trying to import this data into R.
https://www.cdc.gov/healthyyouth/data/yrbs/data.htm
I know I need the survey package, but these files are odd.
Anyone know what to do?
r sas spss
I'm trying to import this data into R.
https://www.cdc.gov/healthyyouth/data/yrbs/data.htm
I know I need the survey package, but these files are odd.
Anyone know what to do?
r sas spss
r sas spss
edited Jan 19 at 12:41
Khaynes
594520
594520
asked Jan 19 at 3:31
ChrisChris
245
245
Hi Chris, the files (*.dat) are fixed width text files, you have to identify the position of the columns and specify them. See stackoverflow.com/questions/14383710/read-fixed-width-text-file
– Khaynes
Jan 19 at 3:35
:O there are 216 letter columns in this dataset. Edit: I’m sorry, 427.
– Chris
Jan 19 at 4:43
I get 314 in thesadc_2017_national.dat
dataset? See answer that I posted.
– Khaynes
Jan 19 at 5:35
Ahhh i see what it is..I was using the National Dataset from 2017. But actually I’ll use the complete dataset.
– Chris
Jan 19 at 13:49
add a comment |
Hi Chris, the files (*.dat) are fixed width text files, you have to identify the position of the columns and specify them. See stackoverflow.com/questions/14383710/read-fixed-width-text-file
– Khaynes
Jan 19 at 3:35
:O there are 216 letter columns in this dataset. Edit: I’m sorry, 427.
– Chris
Jan 19 at 4:43
I get 314 in thesadc_2017_national.dat
dataset? See answer that I posted.
– Khaynes
Jan 19 at 5:35
Ahhh i see what it is..I was using the National Dataset from 2017. But actually I’ll use the complete dataset.
– Chris
Jan 19 at 13:49
Hi Chris, the files (*.dat) are fixed width text files, you have to identify the position of the columns and specify them. See stackoverflow.com/questions/14383710/read-fixed-width-text-file
– Khaynes
Jan 19 at 3:35
Hi Chris, the files (*.dat) are fixed width text files, you have to identify the position of the columns and specify them. See stackoverflow.com/questions/14383710/read-fixed-width-text-file
– Khaynes
Jan 19 at 3:35
:O there are 216 letter columns in this dataset. Edit: I’m sorry, 427.
– Chris
Jan 19 at 4:43
:O there are 216 letter columns in this dataset. Edit: I’m sorry, 427.
– Chris
Jan 19 at 4:43
I get 314 in the
sadc_2017_national.dat
dataset? See answer that I posted.– Khaynes
Jan 19 at 5:35
I get 314 in the
sadc_2017_national.dat
dataset? See answer that I posted.– Khaynes
Jan 19 at 5:35
Ahhh i see what it is..I was using the National Dataset from 2017. But actually I’ll use the complete dataset.
– Chris
Jan 19 at 13:49
Ahhh i see what it is..I was using the National Dataset from 2017. But actually I’ll use the complete dataset.
– Chris
Jan 19 at 13:49
add a comment |
2 Answers
2
active
oldest
votes
To read the data in you can use the read.fwf
base method.
As mentioned in a comment, you can get the concordance from the SPSS syntax: https://www.cdc.gov/healthyyouth/data/yrbs/sadc_2017/2017_sadc_spss_input_program.sps
I've used a text editor to quickly obtain the column widths:
vec <- c(5, 50, 50, 8, 8, 3, 10, 8, 8, 8, 3, 3, 3, 3, 3, 8, 8, 8, 8,
3, 3, 1, 1, 8, 8, 8, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3)
The corresponding names of each column/variable:
names <- c("sitecode", "sitename", "sitetype", "sitetypenum", "year",
"survyear", "weight", "stratum", "PSU", "record", "age", "sex",
"grade", "race4", "race7", "stheight", "stweight", "bmi", "bmipct",
"qnobese", "qnowt", "q67", "q66", "sexid", "sexid2", "sexpart",
"sexpart2", "Q8", "Q9", "Q10", "Q11", "Q12", "Q13", "Q14", "Q15",
"Q16", "Q17", "Q18", "Q19", "Q20", "Q21", "Q22", "Q23", "Q24",
"Q25", "Q26", "Q27", "Q28", "Q29", "Q30", "Q31", "Q32", "Q33",
"Q34", "Q35", "Q36", "Q37", "Q38", "Q39", "Q40", "Q41", "Q42",
"Q43", "Q44", "Q45", "Q46", "Q47", "Q48", "Q49", "Q50", "Q51",
"Q52", "Q53", "Q54", "Q55", "Q56", "Q57", "Q58", "Q59", "Q60",
"Q61", "Q62", "Q63", "Q64", "Q65", "Q68", "Q69", "Q70", "Q71",
"Q72", "Q73", "Q74", "Q75", "Q76", "Q77", "Q78", "Q79", "Q80",
"Q81", "Q82", "Q83", "Q84", "Q85", "Q86", "Q87", "Q88", "Q89",
"QN8", "QN9", "QN10", "QN11", "QN12", "QN13", "QN14", "QN15",
"QN16", "QN17", "QN18", "QN19", "QN20", "QN21", "QN22", "QN23",
"QN24", "QN25", "QN26", "QN27", "QN28", "QN29", "QN30", "QN31",
"QN32", "QN33", "QN34", "QN35", "QN36", "QN37", "QN38", "QN39",
"QN40", "QN41", "QN42", "QN43", "QN44", "QN45", "QN46", "QN47",
"QN48", "QN49", "QN50", "QN51", "QN52", "QN53", "QN54", "QN55",
"QN56", "QN57", "QN58", "QN59", "QN60", "QN61", "QN62", "QN63",
"QN64", "QN65", "QN68", "QN69", "QN70", "QN71", "QN72", "QN73",
"QN74", "QN75", "QN76", "QN77", "QN78", "QN79", "QN80", "QN81",
"QN82", "QN83", "QN84", "QN85", "QN86", "QN87", "QN88", "QN89",
"qnfrcig", "qndaycig", "qnfrevp", "qndayevp", "qnfrskl", "qndayskl",
"qnfrcgr", "qndaycgr", "qntb2", "qntb3", "qntb4", "qniudimp",
"qnshparg", "qnothhpl", "qndualbc", "qnbcnone", "qnfr0", "qnfr1",
"qnfr2", "qnfr3", "qnveg0", "qnveg1", "qnveg2", "qnveg3", "qnsoda1",
"qnsoda2", "qnsoda3", "qnmilk1", "qnmilk2", "qnmilk3", "qnbk7day",
"qnpa0day", "qnpa7day", "qndlype", "qnnodnt", "qbikehelmet",
"qdrivemarijuana", "qcelldriving", "qpropertydamage", "qbullyweight",
"qbullygender", "qbullygay", "qchokeself", "qcigschool", "qchewtobschool",
"qalcoholschool", "qtypealcohol", "qhowmarijuana", "qmarijuanaschool",
"qcurrentcocaine", "qcurrentheroin", "qcurrentmeth", "qhallucdrug",
"qprescription30d", "qgenderexp", "qtaughtHIV", "qtaughtsexed",
"qtaughtstd", "qtaughtcondom", "qtaughtbc", "qdietpop", "qcoffeetea",
"qsportsdrink", "qenergydrink", "qsugardrink", "qwater", "qfastfood",
"qfoodallergy", "qwenthungry", "qmusclestrength", "qsunscreenuse",
"qindoortanning", "qsunburn", "qconcentrating", "qcurrentasthma",
"qwheresleep", "qspeakenglish", "qtransgender", "qnbikehelmet",
"qndrivemarijuana", "qncelldriving", "qnpropertydamage", "qnbullyweight",
"qnbullygender", "qnbullygay", "qnchokeself", "qncigschool",
"qnchewtobschool", "qnalcoholschool", "qntypealcohol", "qnhowmarijuana",
"qnmarijuanaschool", "qncurrentcocaine", "qncurrentheroin", "qncurrentmeth",
"qnhallucdrug", "qnprescription30d", "qngenderexp", "qntaughtHIV",
"qntaughtsexed", "qntaughtstd", "qntaughtcondom", "qntaughtbc",
"qndietpop", "qncoffeetea", "qnsportsdrink", "qnspdrk1", "qnspdrk2",
"qnspdrk3", "qnenergydrink", "qnsugardrink", "qnwater", "qnwater1",
"qnwater2", "qnwater3", "qnfastfood", "qnfoodallergy", "qnwenthungry",
"qnmusclestrength", "qnsunscreenuse", "qnindoortanning", "qnsunburn",
"qnconcentrating", "qncurrentasthma", "qnwheresleep", "qnspeakenglish",
"qntransgender")
As mentioned in an earlier comment, we can use the read.fwf
method to read the fixed with *.dat file in (I have saved just a subset ... I expect it will take a while to read the entire file in):
df <- read.fwf(file = "c:/temp/file", widths = vec)
# Rename columns
names(df) <- names
# Inspect the head.
head(df, n=2)
# sitecode sitename sitetype sitetypenum year survyear weight stratum PSU record age sex grade race4 race7
# 1 XX United States (XX) National 3 1991 1 0.2645 12210 5 29890 . . 1 3 4
# 2 XX United States (XX) National 3 1991 1 0.5060 12310 29 29891 . . . . .
# stheight stweight bmi bmipct qnobese qnowt q67 q66 sexid sexid2 sexpart sexpart2 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Q23 Q24 Q25 Q26 Q27 Q28 Q29 Q30 Q31 Q32 Q33
# 1 . . . . . . NA NA . . . . 2 4 NA NA 4 NA NA NA NA 3 NA NA NA NA NA NA NA NA 2 2 1 1 1 NA 2 4
# 2 . . . . . . NA NA . . . . NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA 1 1 1 1 1 NA 1 1
# Q34 Q35 Q36 Q37 Q38 Q39 Q40 Q41 Q42 Q43 Q44 Q45 Q46 Q47 Q48 Q49 Q50 Q51 Q52 Q53 Q54 Q55 Q56 Q57 Q58 Q59 Q60 Q61 Q62 Q63 Q64 Q65 Q68 Q69 Q70 Q71 Q72 Q73 Q74 Q75 Q76 Q77 Q78 Q79 Q80 Q81 Q82 Q83 Q84
# 1 NA NA NA NA NA NA 4 4 3 NA NA NA 5 5 5 1 NA NA NA NA NA 1 NA NA 1 1 5 4 4 3 3 8 1 NA NA NA NA NA NA NA NA NA NA NA NA NA 6 NA NA
# 2 NA NA NA NA NA NA 6 2 2 NA NA NA 1 1 1 1 NA NA NA NA NA 1 NA NA 1 1 2 2 2 3 3 2 3 NA NA NA NA NA NA NA NA NA NA NA NA NA 6 NA NA
# Q85 Q86 Q87 Q88 Q89 QN8 QN9 QN10 QN11 QN12 QN13 QN14 QN15 QN16 QN17 QN18 QN19 QN20 QN21 QN22 QN23 QN24 QN25 QN26 QN27 QN28 QN29 QN30 QN31 QN32 QN33 QN34 QN35 QN36 QN37 QN38 QN39 QN40 QN41 QN42
# 1 NA NA NA NA NA 1 1 . . 1 . . . . 1 . . . . . . . . 2 2 2 2 1 . 1 2 . . . . . . 1 1 1
# 2 NA NA NA NA NA . . . . . . . . . 2 . . . . . . . . 1 1 2 2 1 . 2 . . . . . . . 1 1 1
# QN43 QN44 QN45 QN46 QN47 QN48 QN49 QN50 QN51 QN52 QN53 QN54 QN55 QN56 QN57 QN58 QN59 QN60 QN61 QN62 QN63 QN64 QN65 QN68 QN69 QN70 QN71 QN72 QN73 QN74 QN75 QN76 QN77 QN78 QN79 QN80 QN81 QN82 QN83
# 1 . . . 1 2 1 2 . . . . . 2 . . 1 1 2 2 1 2 2 2 2 . . . . . . . . . . . . . 1 .
# 2 . . . 2 2 2 2 . . . . . 2 . . 1 1 1 2 2 . . . 2 . . . . . . . . . . . . . 1 .
# QN84 QN85 QN86 QN87 QN88 QN89 qnfrcig qndaycig qnfrevp qndayevp qnfrskl qndayskl qnfrcgr qndaycgr qntb2 qntb3 qntb4 qniudimp qnshparg qnothhpl qndualbc qnbcnone qnfr0 qnfr1 qnfr2 qnfr3 qnveg0
# 1 . . . . . . 2 2 . . . . . . . . . . . . . 2 . . . . .
# 2 . . . . . . 2 2 . . . . . . . . . . . . . . . . . . .
# qnveg1 qnveg2 qnveg3 qnsoda1 qnsoda2 qnsoda3 qnmilk1 qnmilk2 qnmilk3 qnbk7day qnpa0day qnpa7day qndlype qnnodnt qbikehelmet qdrivemarijuana qcelldriving qpropertydamage qbullyweight qbullygender
# 1 . . . . . . . . . . . . 1 . 2 NA NA NA NA NA
# 2 . . . . . . . . . . . . 1 . NA NA NA NA NA NA
# qbullygay qchokeself qcigschool qchewtobschool qalcoholschool qtypealcohol qhowmarijuana qmarijuanaschool qcurrentcocaine qcurrentheroin qcurrentmeth qhallucdrug qprescription30d qgenderexp
# 1 NA NA NA NA NA NA NA NA 1 NA NA NA NA NA
# 2 NA NA NA NA NA NA NA NA 1 NA NA NA NA NA
# qtaughtHIV qtaughtsexed qtaughtstd qtaughtcondom qtaughtbc qdietpop qcoffeetea qsportsdrink qenergydrink qsugardrink qwater qfastfood qfoodallergy qwenthungry qmusclestrength qsunscreenuse
# 1 2 NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NA
# 2 1 NA NA NA NA NA NA NA NA NA NA NA NA NA 2 NA
# qindoortanning qsunburn qconcentrating qcurrentasthma qwheresleep qspeakenglish qtransgender qnbikehelmet qndrivemarijuana qncelldriving qnpropertydamage qnbullyweight qnbullygender qnbullygay
# 1 NA NA NA NA NA NA NA 1 . . . . . .
# 2 NA NA NA NA NA NA NA . . . . . . .
# qnchokeself qncigschool qnchewtobschool qnalcoholschool qntypealcohol qnhowmarijuana qnmarijuanaschool qncurrentcocaine qncurrentheroin qncurrentmeth qnhallucdrug qnprescription30d qngenderexp
# 1 . . . . . . . 2 . . . . .
# 2 . . . . . . . 2 . . . . .
# qntaughtHIV qntaughtsexed qntaughtstd qntaughtcondom qntaughtbc qndietpop qncoffeetea qnsportsdrink qnspdrk1 qnspdrk2 qnspdrk3 qnenergydrink qnsugardrink qnwater qnwater1 qnwater2 qnwater3
# 1 2 . . . . . . . . . . . . . . . .
# 2 1 . . . . . . . . . . . . . . . .
# qnfastfood qnfoodallergy qnwenthungry qnmusclestrength qnsunscreenuse qnindoortanning qnsunburn qnconcentrating qncurrentasthma qnwheresleep qnspeakenglish qntransgender
# 1 . . . 2 . . . . . . . .
# 2 . . . 2 . . . . . . . .
Note that any character columns may need to be trimmed. Missing's are also "." So you would likely want to remove these as well.
Please tell me there’s an automated way to do this, and you have to copy paste. I found that Hadley Wickham posted this dataset on GitHub, but it’s only up to 2013. github.com/hadley/yrbss I may try to repurpose his import script for this data and see if I can use it to import the 2017 data, just out of curiosity. Thanks so much for doing this.. I hope it wasn’t a total pain!
– Chris
Jan 19 at 13:55
Also, there are both NA’s and “."’s in this dataset. Any guesses to what the “." represent as opposed to the genuine NAs?
– Chris
Jan 19 at 20:16
Not 100%. It could be that "." ='s seen the question but not answered, whereas "" (or NA) could have skipped the question. Otherwise it could be the analysis software used and the variable classes or something similar. I think something like the latter as no columns have "." and NA's, right?
– Khaynes
Jan 19 at 20:57
add a comment |
Although I can't answer your question fully, I can get you started. The reason you are unsure what to do is because the data are not formtted in a way you are used to. The data are in an ASCII format. Here's what the website says:
"Note: SAS and SPSS programs need to be used to convert ASCII into SAS and SPSS datasets. How to use the ASCII data varies from one software package to another. Column positions for each variable usually have to be specified. Column positions for each variable can be found in the documentation for each year’s data. Consult your software documentation for more information."
ASCII is just a different way of storing data, like a .csv, or other format, but it's just not as readable as having it all in columns. You can start but searching how to import ASCII data into R and go from there. Sorry I can be of more help.
Yea, it looks like they use a Lookup Table design to save space since it’s a massive dataset. It looks like one file is the data in the form of codes, then uses another file to look up what those codes mean.
– Chris
Jan 19 at 3:52
@Chris, you can get the concordance from the spss file and use that, it will take a while as there are so many columns: cdc.gov/healthyyouth/data/yrbs/sadc_2017/…
– Khaynes
Jan 19 at 3:56
but do you know the code to apply it? or are you saying i have to manually apply each concord :OOOOOO
– Chris
Jan 19 at 4:44
Hadley Wickham wrote some code to load this dataset it in, but it’s from the 2013 dataset: github.com/hadley/yrbss/blob/master/data-raw/survey.R I’m trying to figure out how to repurpose it for the new data
– Chris
Jan 19 at 4:51
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54263846%2fimporting-a-fixed-width-file-into-r-when-the-variables-are-defined-in-another-fi%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
To read the data in you can use the read.fwf
base method.
As mentioned in a comment, you can get the concordance from the SPSS syntax: https://www.cdc.gov/healthyyouth/data/yrbs/sadc_2017/2017_sadc_spss_input_program.sps
I've used a text editor to quickly obtain the column widths:
vec <- c(5, 50, 50, 8, 8, 3, 10, 8, 8, 8, 3, 3, 3, 3, 3, 8, 8, 8, 8,
3, 3, 1, 1, 8, 8, 8, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3)
The corresponding names of each column/variable:
names <- c("sitecode", "sitename", "sitetype", "sitetypenum", "year",
"survyear", "weight", "stratum", "PSU", "record", "age", "sex",
"grade", "race4", "race7", "stheight", "stweight", "bmi", "bmipct",
"qnobese", "qnowt", "q67", "q66", "sexid", "sexid2", "sexpart",
"sexpart2", "Q8", "Q9", "Q10", "Q11", "Q12", "Q13", "Q14", "Q15",
"Q16", "Q17", "Q18", "Q19", "Q20", "Q21", "Q22", "Q23", "Q24",
"Q25", "Q26", "Q27", "Q28", "Q29", "Q30", "Q31", "Q32", "Q33",
"Q34", "Q35", "Q36", "Q37", "Q38", "Q39", "Q40", "Q41", "Q42",
"Q43", "Q44", "Q45", "Q46", "Q47", "Q48", "Q49", "Q50", "Q51",
"Q52", "Q53", "Q54", "Q55", "Q56", "Q57", "Q58", "Q59", "Q60",
"Q61", "Q62", "Q63", "Q64", "Q65", "Q68", "Q69", "Q70", "Q71",
"Q72", "Q73", "Q74", "Q75", "Q76", "Q77", "Q78", "Q79", "Q80",
"Q81", "Q82", "Q83", "Q84", "Q85", "Q86", "Q87", "Q88", "Q89",
"QN8", "QN9", "QN10", "QN11", "QN12", "QN13", "QN14", "QN15",
"QN16", "QN17", "QN18", "QN19", "QN20", "QN21", "QN22", "QN23",
"QN24", "QN25", "QN26", "QN27", "QN28", "QN29", "QN30", "QN31",
"QN32", "QN33", "QN34", "QN35", "QN36", "QN37", "QN38", "QN39",
"QN40", "QN41", "QN42", "QN43", "QN44", "QN45", "QN46", "QN47",
"QN48", "QN49", "QN50", "QN51", "QN52", "QN53", "QN54", "QN55",
"QN56", "QN57", "QN58", "QN59", "QN60", "QN61", "QN62", "QN63",
"QN64", "QN65", "QN68", "QN69", "QN70", "QN71", "QN72", "QN73",
"QN74", "QN75", "QN76", "QN77", "QN78", "QN79", "QN80", "QN81",
"QN82", "QN83", "QN84", "QN85", "QN86", "QN87", "QN88", "QN89",
"qnfrcig", "qndaycig", "qnfrevp", "qndayevp", "qnfrskl", "qndayskl",
"qnfrcgr", "qndaycgr", "qntb2", "qntb3", "qntb4", "qniudimp",
"qnshparg", "qnothhpl", "qndualbc", "qnbcnone", "qnfr0", "qnfr1",
"qnfr2", "qnfr3", "qnveg0", "qnveg1", "qnveg2", "qnveg3", "qnsoda1",
"qnsoda2", "qnsoda3", "qnmilk1", "qnmilk2", "qnmilk3", "qnbk7day",
"qnpa0day", "qnpa7day", "qndlype", "qnnodnt", "qbikehelmet",
"qdrivemarijuana", "qcelldriving", "qpropertydamage", "qbullyweight",
"qbullygender", "qbullygay", "qchokeself", "qcigschool", "qchewtobschool",
"qalcoholschool", "qtypealcohol", "qhowmarijuana", "qmarijuanaschool",
"qcurrentcocaine", "qcurrentheroin", "qcurrentmeth", "qhallucdrug",
"qprescription30d", "qgenderexp", "qtaughtHIV", "qtaughtsexed",
"qtaughtstd", "qtaughtcondom", "qtaughtbc", "qdietpop", "qcoffeetea",
"qsportsdrink", "qenergydrink", "qsugardrink", "qwater", "qfastfood",
"qfoodallergy", "qwenthungry", "qmusclestrength", "qsunscreenuse",
"qindoortanning", "qsunburn", "qconcentrating", "qcurrentasthma",
"qwheresleep", "qspeakenglish", "qtransgender", "qnbikehelmet",
"qndrivemarijuana", "qncelldriving", "qnpropertydamage", "qnbullyweight",
"qnbullygender", "qnbullygay", "qnchokeself", "qncigschool",
"qnchewtobschool", "qnalcoholschool", "qntypealcohol", "qnhowmarijuana",
"qnmarijuanaschool", "qncurrentcocaine", "qncurrentheroin", "qncurrentmeth",
"qnhallucdrug", "qnprescription30d", "qngenderexp", "qntaughtHIV",
"qntaughtsexed", "qntaughtstd", "qntaughtcondom", "qntaughtbc",
"qndietpop", "qncoffeetea", "qnsportsdrink", "qnspdrk1", "qnspdrk2",
"qnspdrk3", "qnenergydrink", "qnsugardrink", "qnwater", "qnwater1",
"qnwater2", "qnwater3", "qnfastfood", "qnfoodallergy", "qnwenthungry",
"qnmusclestrength", "qnsunscreenuse", "qnindoortanning", "qnsunburn",
"qnconcentrating", "qncurrentasthma", "qnwheresleep", "qnspeakenglish",
"qntransgender")
As mentioned in an earlier comment, we can use the read.fwf
method to read the fixed with *.dat file in (I have saved just a subset ... I expect it will take a while to read the entire file in):
df <- read.fwf(file = "c:/temp/file", widths = vec)
# Rename columns
names(df) <- names
# Inspect the head.
head(df, n=2)
# sitecode sitename sitetype sitetypenum year survyear weight stratum PSU record age sex grade race4 race7
# 1 XX United States (XX) National 3 1991 1 0.2645 12210 5 29890 . . 1 3 4
# 2 XX United States (XX) National 3 1991 1 0.5060 12310 29 29891 . . . . .
# stheight stweight bmi bmipct qnobese qnowt q67 q66 sexid sexid2 sexpart sexpart2 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Q23 Q24 Q25 Q26 Q27 Q28 Q29 Q30 Q31 Q32 Q33
# 1 . . . . . . NA NA . . . . 2 4 NA NA 4 NA NA NA NA 3 NA NA NA NA NA NA NA NA 2 2 1 1 1 NA 2 4
# 2 . . . . . . NA NA . . . . NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA 1 1 1 1 1 NA 1 1
# Q34 Q35 Q36 Q37 Q38 Q39 Q40 Q41 Q42 Q43 Q44 Q45 Q46 Q47 Q48 Q49 Q50 Q51 Q52 Q53 Q54 Q55 Q56 Q57 Q58 Q59 Q60 Q61 Q62 Q63 Q64 Q65 Q68 Q69 Q70 Q71 Q72 Q73 Q74 Q75 Q76 Q77 Q78 Q79 Q80 Q81 Q82 Q83 Q84
# 1 NA NA NA NA NA NA 4 4 3 NA NA NA 5 5 5 1 NA NA NA NA NA 1 NA NA 1 1 5 4 4 3 3 8 1 NA NA NA NA NA NA NA NA NA NA NA NA NA 6 NA NA
# 2 NA NA NA NA NA NA 6 2 2 NA NA NA 1 1 1 1 NA NA NA NA NA 1 NA NA 1 1 2 2 2 3 3 2 3 NA NA NA NA NA NA NA NA NA NA NA NA NA 6 NA NA
# Q85 Q86 Q87 Q88 Q89 QN8 QN9 QN10 QN11 QN12 QN13 QN14 QN15 QN16 QN17 QN18 QN19 QN20 QN21 QN22 QN23 QN24 QN25 QN26 QN27 QN28 QN29 QN30 QN31 QN32 QN33 QN34 QN35 QN36 QN37 QN38 QN39 QN40 QN41 QN42
# 1 NA NA NA NA NA 1 1 . . 1 . . . . 1 . . . . . . . . 2 2 2 2 1 . 1 2 . . . . . . 1 1 1
# 2 NA NA NA NA NA . . . . . . . . . 2 . . . . . . . . 1 1 2 2 1 . 2 . . . . . . . 1 1 1
# QN43 QN44 QN45 QN46 QN47 QN48 QN49 QN50 QN51 QN52 QN53 QN54 QN55 QN56 QN57 QN58 QN59 QN60 QN61 QN62 QN63 QN64 QN65 QN68 QN69 QN70 QN71 QN72 QN73 QN74 QN75 QN76 QN77 QN78 QN79 QN80 QN81 QN82 QN83
# 1 . . . 1 2 1 2 . . . . . 2 . . 1 1 2 2 1 2 2 2 2 . . . . . . . . . . . . . 1 .
# 2 . . . 2 2 2 2 . . . . . 2 . . 1 1 1 2 2 . . . 2 . . . . . . . . . . . . . 1 .
# QN84 QN85 QN86 QN87 QN88 QN89 qnfrcig qndaycig qnfrevp qndayevp qnfrskl qndayskl qnfrcgr qndaycgr qntb2 qntb3 qntb4 qniudimp qnshparg qnothhpl qndualbc qnbcnone qnfr0 qnfr1 qnfr2 qnfr3 qnveg0
# 1 . . . . . . 2 2 . . . . . . . . . . . . . 2 . . . . .
# 2 . . . . . . 2 2 . . . . . . . . . . . . . . . . . . .
# qnveg1 qnveg2 qnveg3 qnsoda1 qnsoda2 qnsoda3 qnmilk1 qnmilk2 qnmilk3 qnbk7day qnpa0day qnpa7day qndlype qnnodnt qbikehelmet qdrivemarijuana qcelldriving qpropertydamage qbullyweight qbullygender
# 1 . . . . . . . . . . . . 1 . 2 NA NA NA NA NA
# 2 . . . . . . . . . . . . 1 . NA NA NA NA NA NA
# qbullygay qchokeself qcigschool qchewtobschool qalcoholschool qtypealcohol qhowmarijuana qmarijuanaschool qcurrentcocaine qcurrentheroin qcurrentmeth qhallucdrug qprescription30d qgenderexp
# 1 NA NA NA NA NA NA NA NA 1 NA NA NA NA NA
# 2 NA NA NA NA NA NA NA NA 1 NA NA NA NA NA
# qtaughtHIV qtaughtsexed qtaughtstd qtaughtcondom qtaughtbc qdietpop qcoffeetea qsportsdrink qenergydrink qsugardrink qwater qfastfood qfoodallergy qwenthungry qmusclestrength qsunscreenuse
# 1 2 NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NA
# 2 1 NA NA NA NA NA NA NA NA NA NA NA NA NA 2 NA
# qindoortanning qsunburn qconcentrating qcurrentasthma qwheresleep qspeakenglish qtransgender qnbikehelmet qndrivemarijuana qncelldriving qnpropertydamage qnbullyweight qnbullygender qnbullygay
# 1 NA NA NA NA NA NA NA 1 . . . . . .
# 2 NA NA NA NA NA NA NA . . . . . . .
# qnchokeself qncigschool qnchewtobschool qnalcoholschool qntypealcohol qnhowmarijuana qnmarijuanaschool qncurrentcocaine qncurrentheroin qncurrentmeth qnhallucdrug qnprescription30d qngenderexp
# 1 . . . . . . . 2 . . . . .
# 2 . . . . . . . 2 . . . . .
# qntaughtHIV qntaughtsexed qntaughtstd qntaughtcondom qntaughtbc qndietpop qncoffeetea qnsportsdrink qnspdrk1 qnspdrk2 qnspdrk3 qnenergydrink qnsugardrink qnwater qnwater1 qnwater2 qnwater3
# 1 2 . . . . . . . . . . . . . . . .
# 2 1 . . . . . . . . . . . . . . . .
# qnfastfood qnfoodallergy qnwenthungry qnmusclestrength qnsunscreenuse qnindoortanning qnsunburn qnconcentrating qncurrentasthma qnwheresleep qnspeakenglish qntransgender
# 1 . . . 2 . . . . . . . .
# 2 . . . 2 . . . . . . . .
Note that any character columns may need to be trimmed. Missing's are also "." So you would likely want to remove these as well.
Please tell me there’s an automated way to do this, and you have to copy paste. I found that Hadley Wickham posted this dataset on GitHub, but it’s only up to 2013. github.com/hadley/yrbss I may try to repurpose his import script for this data and see if I can use it to import the 2017 data, just out of curiosity. Thanks so much for doing this.. I hope it wasn’t a total pain!
– Chris
Jan 19 at 13:55
Also, there are both NA’s and “."’s in this dataset. Any guesses to what the “." represent as opposed to the genuine NAs?
– Chris
Jan 19 at 20:16
Not 100%. It could be that "." ='s seen the question but not answered, whereas "" (or NA) could have skipped the question. Otherwise it could be the analysis software used and the variable classes or something similar. I think something like the latter as no columns have "." and NA's, right?
– Khaynes
Jan 19 at 20:57
add a comment |
To read the data in you can use the read.fwf
base method.
As mentioned in a comment, you can get the concordance from the SPSS syntax: https://www.cdc.gov/healthyyouth/data/yrbs/sadc_2017/2017_sadc_spss_input_program.sps
I've used a text editor to quickly obtain the column widths:
vec <- c(5, 50, 50, 8, 8, 3, 10, 8, 8, 8, 3, 3, 3, 3, 3, 8, 8, 8, 8,
3, 3, 1, 1, 8, 8, 8, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3)
The corresponding names of each column/variable:
names <- c("sitecode", "sitename", "sitetype", "sitetypenum", "year",
"survyear", "weight", "stratum", "PSU", "record", "age", "sex",
"grade", "race4", "race7", "stheight", "stweight", "bmi", "bmipct",
"qnobese", "qnowt", "q67", "q66", "sexid", "sexid2", "sexpart",
"sexpart2", "Q8", "Q9", "Q10", "Q11", "Q12", "Q13", "Q14", "Q15",
"Q16", "Q17", "Q18", "Q19", "Q20", "Q21", "Q22", "Q23", "Q24",
"Q25", "Q26", "Q27", "Q28", "Q29", "Q30", "Q31", "Q32", "Q33",
"Q34", "Q35", "Q36", "Q37", "Q38", "Q39", "Q40", "Q41", "Q42",
"Q43", "Q44", "Q45", "Q46", "Q47", "Q48", "Q49", "Q50", "Q51",
"Q52", "Q53", "Q54", "Q55", "Q56", "Q57", "Q58", "Q59", "Q60",
"Q61", "Q62", "Q63", "Q64", "Q65", "Q68", "Q69", "Q70", "Q71",
"Q72", "Q73", "Q74", "Q75", "Q76", "Q77", "Q78", "Q79", "Q80",
"Q81", "Q82", "Q83", "Q84", "Q85", "Q86", "Q87", "Q88", "Q89",
"QN8", "QN9", "QN10", "QN11", "QN12", "QN13", "QN14", "QN15",
"QN16", "QN17", "QN18", "QN19", "QN20", "QN21", "QN22", "QN23",
"QN24", "QN25", "QN26", "QN27", "QN28", "QN29", "QN30", "QN31",
"QN32", "QN33", "QN34", "QN35", "QN36", "QN37", "QN38", "QN39",
"QN40", "QN41", "QN42", "QN43", "QN44", "QN45", "QN46", "QN47",
"QN48", "QN49", "QN50", "QN51", "QN52", "QN53", "QN54", "QN55",
"QN56", "QN57", "QN58", "QN59", "QN60", "QN61", "QN62", "QN63",
"QN64", "QN65", "QN68", "QN69", "QN70", "QN71", "QN72", "QN73",
"QN74", "QN75", "QN76", "QN77", "QN78", "QN79", "QN80", "QN81",
"QN82", "QN83", "QN84", "QN85", "QN86", "QN87", "QN88", "QN89",
"qnfrcig", "qndaycig", "qnfrevp", "qndayevp", "qnfrskl", "qndayskl",
"qnfrcgr", "qndaycgr", "qntb2", "qntb3", "qntb4", "qniudimp",
"qnshparg", "qnothhpl", "qndualbc", "qnbcnone", "qnfr0", "qnfr1",
"qnfr2", "qnfr3", "qnveg0", "qnveg1", "qnveg2", "qnveg3", "qnsoda1",
"qnsoda2", "qnsoda3", "qnmilk1", "qnmilk2", "qnmilk3", "qnbk7day",
"qnpa0day", "qnpa7day", "qndlype", "qnnodnt", "qbikehelmet",
"qdrivemarijuana", "qcelldriving", "qpropertydamage", "qbullyweight",
"qbullygender", "qbullygay", "qchokeself", "qcigschool", "qchewtobschool",
"qalcoholschool", "qtypealcohol", "qhowmarijuana", "qmarijuanaschool",
"qcurrentcocaine", "qcurrentheroin", "qcurrentmeth", "qhallucdrug",
"qprescription30d", "qgenderexp", "qtaughtHIV", "qtaughtsexed",
"qtaughtstd", "qtaughtcondom", "qtaughtbc", "qdietpop", "qcoffeetea",
"qsportsdrink", "qenergydrink", "qsugardrink", "qwater", "qfastfood",
"qfoodallergy", "qwenthungry", "qmusclestrength", "qsunscreenuse",
"qindoortanning", "qsunburn", "qconcentrating", "qcurrentasthma",
"qwheresleep", "qspeakenglish", "qtransgender", "qnbikehelmet",
"qndrivemarijuana", "qncelldriving", "qnpropertydamage", "qnbullyweight",
"qnbullygender", "qnbullygay", "qnchokeself", "qncigschool",
"qnchewtobschool", "qnalcoholschool", "qntypealcohol", "qnhowmarijuana",
"qnmarijuanaschool", "qncurrentcocaine", "qncurrentheroin", "qncurrentmeth",
"qnhallucdrug", "qnprescription30d", "qngenderexp", "qntaughtHIV",
"qntaughtsexed", "qntaughtstd", "qntaughtcondom", "qntaughtbc",
"qndietpop", "qncoffeetea", "qnsportsdrink", "qnspdrk1", "qnspdrk2",
"qnspdrk3", "qnenergydrink", "qnsugardrink", "qnwater", "qnwater1",
"qnwater2", "qnwater3", "qnfastfood", "qnfoodallergy", "qnwenthungry",
"qnmusclestrength", "qnsunscreenuse", "qnindoortanning", "qnsunburn",
"qnconcentrating", "qncurrentasthma", "qnwheresleep", "qnspeakenglish",
"qntransgender")
As mentioned in an earlier comment, we can use the read.fwf
method to read the fixed with *.dat file in (I have saved just a subset ... I expect it will take a while to read the entire file in):
df <- read.fwf(file = "c:/temp/file", widths = vec)
# Rename columns
names(df) <- names
# Inspect the head.
head(df, n=2)
# sitecode sitename sitetype sitetypenum year survyear weight stratum PSU record age sex grade race4 race7
# 1 XX United States (XX) National 3 1991 1 0.2645 12210 5 29890 . . 1 3 4
# 2 XX United States (XX) National 3 1991 1 0.5060 12310 29 29891 . . . . .
# stheight stweight bmi bmipct qnobese qnowt q67 q66 sexid sexid2 sexpart sexpart2 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Q23 Q24 Q25 Q26 Q27 Q28 Q29 Q30 Q31 Q32 Q33
# 1 . . . . . . NA NA . . . . 2 4 NA NA 4 NA NA NA NA 3 NA NA NA NA NA NA NA NA 2 2 1 1 1 NA 2 4
# 2 . . . . . . NA NA . . . . NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA 1 1 1 1 1 NA 1 1
# Q34 Q35 Q36 Q37 Q38 Q39 Q40 Q41 Q42 Q43 Q44 Q45 Q46 Q47 Q48 Q49 Q50 Q51 Q52 Q53 Q54 Q55 Q56 Q57 Q58 Q59 Q60 Q61 Q62 Q63 Q64 Q65 Q68 Q69 Q70 Q71 Q72 Q73 Q74 Q75 Q76 Q77 Q78 Q79 Q80 Q81 Q82 Q83 Q84
# 1 NA NA NA NA NA NA 4 4 3 NA NA NA 5 5 5 1 NA NA NA NA NA 1 NA NA 1 1 5 4 4 3 3 8 1 NA NA NA NA NA NA NA NA NA NA NA NA NA 6 NA NA
# 2 NA NA NA NA NA NA 6 2 2 NA NA NA 1 1 1 1 NA NA NA NA NA 1 NA NA 1 1 2 2 2 3 3 2 3 NA NA NA NA NA NA NA NA NA NA NA NA NA 6 NA NA
# Q85 Q86 Q87 Q88 Q89 QN8 QN9 QN10 QN11 QN12 QN13 QN14 QN15 QN16 QN17 QN18 QN19 QN20 QN21 QN22 QN23 QN24 QN25 QN26 QN27 QN28 QN29 QN30 QN31 QN32 QN33 QN34 QN35 QN36 QN37 QN38 QN39 QN40 QN41 QN42
# 1 NA NA NA NA NA 1 1 . . 1 . . . . 1 . . . . . . . . 2 2 2 2 1 . 1 2 . . . . . . 1 1 1
# 2 NA NA NA NA NA . . . . . . . . . 2 . . . . . . . . 1 1 2 2 1 . 2 . . . . . . . 1 1 1
# QN43 QN44 QN45 QN46 QN47 QN48 QN49 QN50 QN51 QN52 QN53 QN54 QN55 QN56 QN57 QN58 QN59 QN60 QN61 QN62 QN63 QN64 QN65 QN68 QN69 QN70 QN71 QN72 QN73 QN74 QN75 QN76 QN77 QN78 QN79 QN80 QN81 QN82 QN83
# 1 . . . 1 2 1 2 . . . . . 2 . . 1 1 2 2 1 2 2 2 2 . . . . . . . . . . . . . 1 .
# 2 . . . 2 2 2 2 . . . . . 2 . . 1 1 1 2 2 . . . 2 . . . . . . . . . . . . . 1 .
# QN84 QN85 QN86 QN87 QN88 QN89 qnfrcig qndaycig qnfrevp qndayevp qnfrskl qndayskl qnfrcgr qndaycgr qntb2 qntb3 qntb4 qniudimp qnshparg qnothhpl qndualbc qnbcnone qnfr0 qnfr1 qnfr2 qnfr3 qnveg0
# 1 . . . . . . 2 2 . . . . . . . . . . . . . 2 . . . . .
# 2 . . . . . . 2 2 . . . . . . . . . . . . . . . . . . .
# qnveg1 qnveg2 qnveg3 qnsoda1 qnsoda2 qnsoda3 qnmilk1 qnmilk2 qnmilk3 qnbk7day qnpa0day qnpa7day qndlype qnnodnt qbikehelmet qdrivemarijuana qcelldriving qpropertydamage qbullyweight qbullygender
# 1 . . . . . . . . . . . . 1 . 2 NA NA NA NA NA
# 2 . . . . . . . . . . . . 1 . NA NA NA NA NA NA
# qbullygay qchokeself qcigschool qchewtobschool qalcoholschool qtypealcohol qhowmarijuana qmarijuanaschool qcurrentcocaine qcurrentheroin qcurrentmeth qhallucdrug qprescription30d qgenderexp
# 1 NA NA NA NA NA NA NA NA 1 NA NA NA NA NA
# 2 NA NA NA NA NA NA NA NA 1 NA NA NA NA NA
# qtaughtHIV qtaughtsexed qtaughtstd qtaughtcondom qtaughtbc qdietpop qcoffeetea qsportsdrink qenergydrink qsugardrink qwater qfastfood qfoodallergy qwenthungry qmusclestrength qsunscreenuse
# 1 2 NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NA
# 2 1 NA NA NA NA NA NA NA NA NA NA NA NA NA 2 NA
# qindoortanning qsunburn qconcentrating qcurrentasthma qwheresleep qspeakenglish qtransgender qnbikehelmet qndrivemarijuana qncelldriving qnpropertydamage qnbullyweight qnbullygender qnbullygay
# 1 NA NA NA NA NA NA NA 1 . . . . . .
# 2 NA NA NA NA NA NA NA . . . . . . .
# qnchokeself qncigschool qnchewtobschool qnalcoholschool qntypealcohol qnhowmarijuana qnmarijuanaschool qncurrentcocaine qncurrentheroin qncurrentmeth qnhallucdrug qnprescription30d qngenderexp
# 1 . . . . . . . 2 . . . . .
# 2 . . . . . . . 2 . . . . .
# qntaughtHIV qntaughtsexed qntaughtstd qntaughtcondom qntaughtbc qndietpop qncoffeetea qnsportsdrink qnspdrk1 qnspdrk2 qnspdrk3 qnenergydrink qnsugardrink qnwater qnwater1 qnwater2 qnwater3
# 1 2 . . . . . . . . . . . . . . . .
# 2 1 . . . . . . . . . . . . . . . .
# qnfastfood qnfoodallergy qnwenthungry qnmusclestrength qnsunscreenuse qnindoortanning qnsunburn qnconcentrating qncurrentasthma qnwheresleep qnspeakenglish qntransgender
# 1 . . . 2 . . . . . . . .
# 2 . . . 2 . . . . . . . .
Note that any character columns may need to be trimmed. Missing's are also "." So you would likely want to remove these as well.
Please tell me there’s an automated way to do this, and you have to copy paste. I found that Hadley Wickham posted this dataset on GitHub, but it’s only up to 2013. github.com/hadley/yrbss I may try to repurpose his import script for this data and see if I can use it to import the 2017 data, just out of curiosity. Thanks so much for doing this.. I hope it wasn’t a total pain!
– Chris
Jan 19 at 13:55
Also, there are both NA’s and “."’s in this dataset. Any guesses to what the “." represent as opposed to the genuine NAs?
– Chris
Jan 19 at 20:16
Not 100%. It could be that "." ='s seen the question but not answered, whereas "" (or NA) could have skipped the question. Otherwise it could be the analysis software used and the variable classes or something similar. I think something like the latter as no columns have "." and NA's, right?
– Khaynes
Jan 19 at 20:57
add a comment |
To read the data in you can use the read.fwf
base method.
As mentioned in a comment, you can get the concordance from the SPSS syntax: https://www.cdc.gov/healthyyouth/data/yrbs/sadc_2017/2017_sadc_spss_input_program.sps
I've used a text editor to quickly obtain the column widths:
vec <- c(5, 50, 50, 8, 8, 3, 10, 8, 8, 8, 3, 3, 3, 3, 3, 8, 8, 8, 8,
3, 3, 1, 1, 8, 8, 8, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3)
The corresponding names of each column/variable:
names <- c("sitecode", "sitename", "sitetype", "sitetypenum", "year",
"survyear", "weight", "stratum", "PSU", "record", "age", "sex",
"grade", "race4", "race7", "stheight", "stweight", "bmi", "bmipct",
"qnobese", "qnowt", "q67", "q66", "sexid", "sexid2", "sexpart",
"sexpart2", "Q8", "Q9", "Q10", "Q11", "Q12", "Q13", "Q14", "Q15",
"Q16", "Q17", "Q18", "Q19", "Q20", "Q21", "Q22", "Q23", "Q24",
"Q25", "Q26", "Q27", "Q28", "Q29", "Q30", "Q31", "Q32", "Q33",
"Q34", "Q35", "Q36", "Q37", "Q38", "Q39", "Q40", "Q41", "Q42",
"Q43", "Q44", "Q45", "Q46", "Q47", "Q48", "Q49", "Q50", "Q51",
"Q52", "Q53", "Q54", "Q55", "Q56", "Q57", "Q58", "Q59", "Q60",
"Q61", "Q62", "Q63", "Q64", "Q65", "Q68", "Q69", "Q70", "Q71",
"Q72", "Q73", "Q74", "Q75", "Q76", "Q77", "Q78", "Q79", "Q80",
"Q81", "Q82", "Q83", "Q84", "Q85", "Q86", "Q87", "Q88", "Q89",
"QN8", "QN9", "QN10", "QN11", "QN12", "QN13", "QN14", "QN15",
"QN16", "QN17", "QN18", "QN19", "QN20", "QN21", "QN22", "QN23",
"QN24", "QN25", "QN26", "QN27", "QN28", "QN29", "QN30", "QN31",
"QN32", "QN33", "QN34", "QN35", "QN36", "QN37", "QN38", "QN39",
"QN40", "QN41", "QN42", "QN43", "QN44", "QN45", "QN46", "QN47",
"QN48", "QN49", "QN50", "QN51", "QN52", "QN53", "QN54", "QN55",
"QN56", "QN57", "QN58", "QN59", "QN60", "QN61", "QN62", "QN63",
"QN64", "QN65", "QN68", "QN69", "QN70", "QN71", "QN72", "QN73",
"QN74", "QN75", "QN76", "QN77", "QN78", "QN79", "QN80", "QN81",
"QN82", "QN83", "QN84", "QN85", "QN86", "QN87", "QN88", "QN89",
"qnfrcig", "qndaycig", "qnfrevp", "qndayevp", "qnfrskl", "qndayskl",
"qnfrcgr", "qndaycgr", "qntb2", "qntb3", "qntb4", "qniudimp",
"qnshparg", "qnothhpl", "qndualbc", "qnbcnone", "qnfr0", "qnfr1",
"qnfr2", "qnfr3", "qnveg0", "qnveg1", "qnveg2", "qnveg3", "qnsoda1",
"qnsoda2", "qnsoda3", "qnmilk1", "qnmilk2", "qnmilk3", "qnbk7day",
"qnpa0day", "qnpa7day", "qndlype", "qnnodnt", "qbikehelmet",
"qdrivemarijuana", "qcelldriving", "qpropertydamage", "qbullyweight",
"qbullygender", "qbullygay", "qchokeself", "qcigschool", "qchewtobschool",
"qalcoholschool", "qtypealcohol", "qhowmarijuana", "qmarijuanaschool",
"qcurrentcocaine", "qcurrentheroin", "qcurrentmeth", "qhallucdrug",
"qprescription30d", "qgenderexp", "qtaughtHIV", "qtaughtsexed",
"qtaughtstd", "qtaughtcondom", "qtaughtbc", "qdietpop", "qcoffeetea",
"qsportsdrink", "qenergydrink", "qsugardrink", "qwater", "qfastfood",
"qfoodallergy", "qwenthungry", "qmusclestrength", "qsunscreenuse",
"qindoortanning", "qsunburn", "qconcentrating", "qcurrentasthma",
"qwheresleep", "qspeakenglish", "qtransgender", "qnbikehelmet",
"qndrivemarijuana", "qncelldriving", "qnpropertydamage", "qnbullyweight",
"qnbullygender", "qnbullygay", "qnchokeself", "qncigschool",
"qnchewtobschool", "qnalcoholschool", "qntypealcohol", "qnhowmarijuana",
"qnmarijuanaschool", "qncurrentcocaine", "qncurrentheroin", "qncurrentmeth",
"qnhallucdrug", "qnprescription30d", "qngenderexp", "qntaughtHIV",
"qntaughtsexed", "qntaughtstd", "qntaughtcondom", "qntaughtbc",
"qndietpop", "qncoffeetea", "qnsportsdrink", "qnspdrk1", "qnspdrk2",
"qnspdrk3", "qnenergydrink", "qnsugardrink", "qnwater", "qnwater1",
"qnwater2", "qnwater3", "qnfastfood", "qnfoodallergy", "qnwenthungry",
"qnmusclestrength", "qnsunscreenuse", "qnindoortanning", "qnsunburn",
"qnconcentrating", "qncurrentasthma", "qnwheresleep", "qnspeakenglish",
"qntransgender")
As mentioned in an earlier comment, we can use the read.fwf
method to read the fixed with *.dat file in (I have saved just a subset ... I expect it will take a while to read the entire file in):
df <- read.fwf(file = "c:/temp/file", widths = vec)
# Rename columns
names(df) <- names
# Inspect the head.
head(df, n=2)
# sitecode sitename sitetype sitetypenum year survyear weight stratum PSU record age sex grade race4 race7
# 1 XX United States (XX) National 3 1991 1 0.2645 12210 5 29890 . . 1 3 4
# 2 XX United States (XX) National 3 1991 1 0.5060 12310 29 29891 . . . . .
# stheight stweight bmi bmipct qnobese qnowt q67 q66 sexid sexid2 sexpart sexpart2 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Q23 Q24 Q25 Q26 Q27 Q28 Q29 Q30 Q31 Q32 Q33
# 1 . . . . . . NA NA . . . . 2 4 NA NA 4 NA NA NA NA 3 NA NA NA NA NA NA NA NA 2 2 1 1 1 NA 2 4
# 2 . . . . . . NA NA . . . . NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA 1 1 1 1 1 NA 1 1
# Q34 Q35 Q36 Q37 Q38 Q39 Q40 Q41 Q42 Q43 Q44 Q45 Q46 Q47 Q48 Q49 Q50 Q51 Q52 Q53 Q54 Q55 Q56 Q57 Q58 Q59 Q60 Q61 Q62 Q63 Q64 Q65 Q68 Q69 Q70 Q71 Q72 Q73 Q74 Q75 Q76 Q77 Q78 Q79 Q80 Q81 Q82 Q83 Q84
# 1 NA NA NA NA NA NA 4 4 3 NA NA NA 5 5 5 1 NA NA NA NA NA 1 NA NA 1 1 5 4 4 3 3 8 1 NA NA NA NA NA NA NA NA NA NA NA NA NA 6 NA NA
# 2 NA NA NA NA NA NA 6 2 2 NA NA NA 1 1 1 1 NA NA NA NA NA 1 NA NA 1 1 2 2 2 3 3 2 3 NA NA NA NA NA NA NA NA NA NA NA NA NA 6 NA NA
# Q85 Q86 Q87 Q88 Q89 QN8 QN9 QN10 QN11 QN12 QN13 QN14 QN15 QN16 QN17 QN18 QN19 QN20 QN21 QN22 QN23 QN24 QN25 QN26 QN27 QN28 QN29 QN30 QN31 QN32 QN33 QN34 QN35 QN36 QN37 QN38 QN39 QN40 QN41 QN42
# 1 NA NA NA NA NA 1 1 . . 1 . . . . 1 . . . . . . . . 2 2 2 2 1 . 1 2 . . . . . . 1 1 1
# 2 NA NA NA NA NA . . . . . . . . . 2 . . . . . . . . 1 1 2 2 1 . 2 . . . . . . . 1 1 1
# QN43 QN44 QN45 QN46 QN47 QN48 QN49 QN50 QN51 QN52 QN53 QN54 QN55 QN56 QN57 QN58 QN59 QN60 QN61 QN62 QN63 QN64 QN65 QN68 QN69 QN70 QN71 QN72 QN73 QN74 QN75 QN76 QN77 QN78 QN79 QN80 QN81 QN82 QN83
# 1 . . . 1 2 1 2 . . . . . 2 . . 1 1 2 2 1 2 2 2 2 . . . . . . . . . . . . . 1 .
# 2 . . . 2 2 2 2 . . . . . 2 . . 1 1 1 2 2 . . . 2 . . . . . . . . . . . . . 1 .
# QN84 QN85 QN86 QN87 QN88 QN89 qnfrcig qndaycig qnfrevp qndayevp qnfrskl qndayskl qnfrcgr qndaycgr qntb2 qntb3 qntb4 qniudimp qnshparg qnothhpl qndualbc qnbcnone qnfr0 qnfr1 qnfr2 qnfr3 qnveg0
# 1 . . . . . . 2 2 . . . . . . . . . . . . . 2 . . . . .
# 2 . . . . . . 2 2 . . . . . . . . . . . . . . . . . . .
# qnveg1 qnveg2 qnveg3 qnsoda1 qnsoda2 qnsoda3 qnmilk1 qnmilk2 qnmilk3 qnbk7day qnpa0day qnpa7day qndlype qnnodnt qbikehelmet qdrivemarijuana qcelldriving qpropertydamage qbullyweight qbullygender
# 1 . . . . . . . . . . . . 1 . 2 NA NA NA NA NA
# 2 . . . . . . . . . . . . 1 . NA NA NA NA NA NA
# qbullygay qchokeself qcigschool qchewtobschool qalcoholschool qtypealcohol qhowmarijuana qmarijuanaschool qcurrentcocaine qcurrentheroin qcurrentmeth qhallucdrug qprescription30d qgenderexp
# 1 NA NA NA NA NA NA NA NA 1 NA NA NA NA NA
# 2 NA NA NA NA NA NA NA NA 1 NA NA NA NA NA
# qtaughtHIV qtaughtsexed qtaughtstd qtaughtcondom qtaughtbc qdietpop qcoffeetea qsportsdrink qenergydrink qsugardrink qwater qfastfood qfoodallergy qwenthungry qmusclestrength qsunscreenuse
# 1 2 NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NA
# 2 1 NA NA NA NA NA NA NA NA NA NA NA NA NA 2 NA
# qindoortanning qsunburn qconcentrating qcurrentasthma qwheresleep qspeakenglish qtransgender qnbikehelmet qndrivemarijuana qncelldriving qnpropertydamage qnbullyweight qnbullygender qnbullygay
# 1 NA NA NA NA NA NA NA 1 . . . . . .
# 2 NA NA NA NA NA NA NA . . . . . . .
# qnchokeself qncigschool qnchewtobschool qnalcoholschool qntypealcohol qnhowmarijuana qnmarijuanaschool qncurrentcocaine qncurrentheroin qncurrentmeth qnhallucdrug qnprescription30d qngenderexp
# 1 . . . . . . . 2 . . . . .
# 2 . . . . . . . 2 . . . . .
# qntaughtHIV qntaughtsexed qntaughtstd qntaughtcondom qntaughtbc qndietpop qncoffeetea qnsportsdrink qnspdrk1 qnspdrk2 qnspdrk3 qnenergydrink qnsugardrink qnwater qnwater1 qnwater2 qnwater3
# 1 2 . . . . . . . . . . . . . . . .
# 2 1 . . . . . . . . . . . . . . . .
# qnfastfood qnfoodallergy qnwenthungry qnmusclestrength qnsunscreenuse qnindoortanning qnsunburn qnconcentrating qncurrentasthma qnwheresleep qnspeakenglish qntransgender
# 1 . . . 2 . . . . . . . .
# 2 . . . 2 . . . . . . . .
Note that any character columns may need to be trimmed. Missing's are also "." So you would likely want to remove these as well.
To read the data in you can use the read.fwf
base method.
As mentioned in a comment, you can get the concordance from the SPSS syntax: https://www.cdc.gov/healthyyouth/data/yrbs/sadc_2017/2017_sadc_spss_input_program.sps
I've used a text editor to quickly obtain the column widths:
vec <- c(5, 50, 50, 8, 8, 3, 10, 8, 8, 8, 3, 3, 3, 3, 3, 8, 8, 8, 8,
3, 3, 1, 1, 8, 8, 8, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3)
The corresponding names of each column/variable:
names <- c("sitecode", "sitename", "sitetype", "sitetypenum", "year",
"survyear", "weight", "stratum", "PSU", "record", "age", "sex",
"grade", "race4", "race7", "stheight", "stweight", "bmi", "bmipct",
"qnobese", "qnowt", "q67", "q66", "sexid", "sexid2", "sexpart",
"sexpart2", "Q8", "Q9", "Q10", "Q11", "Q12", "Q13", "Q14", "Q15",
"Q16", "Q17", "Q18", "Q19", "Q20", "Q21", "Q22", "Q23", "Q24",
"Q25", "Q26", "Q27", "Q28", "Q29", "Q30", "Q31", "Q32", "Q33",
"Q34", "Q35", "Q36", "Q37", "Q38", "Q39", "Q40", "Q41", "Q42",
"Q43", "Q44", "Q45", "Q46", "Q47", "Q48", "Q49", "Q50", "Q51",
"Q52", "Q53", "Q54", "Q55", "Q56", "Q57", "Q58", "Q59", "Q60",
"Q61", "Q62", "Q63", "Q64", "Q65", "Q68", "Q69", "Q70", "Q71",
"Q72", "Q73", "Q74", "Q75", "Q76", "Q77", "Q78", "Q79", "Q80",
"Q81", "Q82", "Q83", "Q84", "Q85", "Q86", "Q87", "Q88", "Q89",
"QN8", "QN9", "QN10", "QN11", "QN12", "QN13", "QN14", "QN15",
"QN16", "QN17", "QN18", "QN19", "QN20", "QN21", "QN22", "QN23",
"QN24", "QN25", "QN26", "QN27", "QN28", "QN29", "QN30", "QN31",
"QN32", "QN33", "QN34", "QN35", "QN36", "QN37", "QN38", "QN39",
"QN40", "QN41", "QN42", "QN43", "QN44", "QN45", "QN46", "QN47",
"QN48", "QN49", "QN50", "QN51", "QN52", "QN53", "QN54", "QN55",
"QN56", "QN57", "QN58", "QN59", "QN60", "QN61", "QN62", "QN63",
"QN64", "QN65", "QN68", "QN69", "QN70", "QN71", "QN72", "QN73",
"QN74", "QN75", "QN76", "QN77", "QN78", "QN79", "QN80", "QN81",
"QN82", "QN83", "QN84", "QN85", "QN86", "QN87", "QN88", "QN89",
"qnfrcig", "qndaycig", "qnfrevp", "qndayevp", "qnfrskl", "qndayskl",
"qnfrcgr", "qndaycgr", "qntb2", "qntb3", "qntb4", "qniudimp",
"qnshparg", "qnothhpl", "qndualbc", "qnbcnone", "qnfr0", "qnfr1",
"qnfr2", "qnfr3", "qnveg0", "qnveg1", "qnveg2", "qnveg3", "qnsoda1",
"qnsoda2", "qnsoda3", "qnmilk1", "qnmilk2", "qnmilk3", "qnbk7day",
"qnpa0day", "qnpa7day", "qndlype", "qnnodnt", "qbikehelmet",
"qdrivemarijuana", "qcelldriving", "qpropertydamage", "qbullyweight",
"qbullygender", "qbullygay", "qchokeself", "qcigschool", "qchewtobschool",
"qalcoholschool", "qtypealcohol", "qhowmarijuana", "qmarijuanaschool",
"qcurrentcocaine", "qcurrentheroin", "qcurrentmeth", "qhallucdrug",
"qprescription30d", "qgenderexp", "qtaughtHIV", "qtaughtsexed",
"qtaughtstd", "qtaughtcondom", "qtaughtbc", "qdietpop", "qcoffeetea",
"qsportsdrink", "qenergydrink", "qsugardrink", "qwater", "qfastfood",
"qfoodallergy", "qwenthungry", "qmusclestrength", "qsunscreenuse",
"qindoortanning", "qsunburn", "qconcentrating", "qcurrentasthma",
"qwheresleep", "qspeakenglish", "qtransgender", "qnbikehelmet",
"qndrivemarijuana", "qncelldriving", "qnpropertydamage", "qnbullyweight",
"qnbullygender", "qnbullygay", "qnchokeself", "qncigschool",
"qnchewtobschool", "qnalcoholschool", "qntypealcohol", "qnhowmarijuana",
"qnmarijuanaschool", "qncurrentcocaine", "qncurrentheroin", "qncurrentmeth",
"qnhallucdrug", "qnprescription30d", "qngenderexp", "qntaughtHIV",
"qntaughtsexed", "qntaughtstd", "qntaughtcondom", "qntaughtbc",
"qndietpop", "qncoffeetea", "qnsportsdrink", "qnspdrk1", "qnspdrk2",
"qnspdrk3", "qnenergydrink", "qnsugardrink", "qnwater", "qnwater1",
"qnwater2", "qnwater3", "qnfastfood", "qnfoodallergy", "qnwenthungry",
"qnmusclestrength", "qnsunscreenuse", "qnindoortanning", "qnsunburn",
"qnconcentrating", "qncurrentasthma", "qnwheresleep", "qnspeakenglish",
"qntransgender")
As mentioned in an earlier comment, we can use the read.fwf
method to read the fixed with *.dat file in (I have saved just a subset ... I expect it will take a while to read the entire file in):
df <- read.fwf(file = "c:/temp/file", widths = vec)
# Rename columns
names(df) <- names
# Inspect the head.
head(df, n=2)
# sitecode sitename sitetype sitetypenum year survyear weight stratum PSU record age sex grade race4 race7
# 1 XX United States (XX) National 3 1991 1 0.2645 12210 5 29890 . . 1 3 4
# 2 XX United States (XX) National 3 1991 1 0.5060 12310 29 29891 . . . . .
# stheight stweight bmi bmipct qnobese qnowt q67 q66 sexid sexid2 sexpart sexpart2 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Q23 Q24 Q25 Q26 Q27 Q28 Q29 Q30 Q31 Q32 Q33
# 1 . . . . . . NA NA . . . . 2 4 NA NA 4 NA NA NA NA 3 NA NA NA NA NA NA NA NA 2 2 1 1 1 NA 2 4
# 2 . . . . . . NA NA . . . . NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA 1 1 1 1 1 NA 1 1
# Q34 Q35 Q36 Q37 Q38 Q39 Q40 Q41 Q42 Q43 Q44 Q45 Q46 Q47 Q48 Q49 Q50 Q51 Q52 Q53 Q54 Q55 Q56 Q57 Q58 Q59 Q60 Q61 Q62 Q63 Q64 Q65 Q68 Q69 Q70 Q71 Q72 Q73 Q74 Q75 Q76 Q77 Q78 Q79 Q80 Q81 Q82 Q83 Q84
# 1 NA NA NA NA NA NA 4 4 3 NA NA NA 5 5 5 1 NA NA NA NA NA 1 NA NA 1 1 5 4 4 3 3 8 1 NA NA NA NA NA NA NA NA NA NA NA NA NA 6 NA NA
# 2 NA NA NA NA NA NA 6 2 2 NA NA NA 1 1 1 1 NA NA NA NA NA 1 NA NA 1 1 2 2 2 3 3 2 3 NA NA NA NA NA NA NA NA NA NA NA NA NA 6 NA NA
# Q85 Q86 Q87 Q88 Q89 QN8 QN9 QN10 QN11 QN12 QN13 QN14 QN15 QN16 QN17 QN18 QN19 QN20 QN21 QN22 QN23 QN24 QN25 QN26 QN27 QN28 QN29 QN30 QN31 QN32 QN33 QN34 QN35 QN36 QN37 QN38 QN39 QN40 QN41 QN42
# 1 NA NA NA NA NA 1 1 . . 1 . . . . 1 . . . . . . . . 2 2 2 2 1 . 1 2 . . . . . . 1 1 1
# 2 NA NA NA NA NA . . . . . . . . . 2 . . . . . . . . 1 1 2 2 1 . 2 . . . . . . . 1 1 1
# QN43 QN44 QN45 QN46 QN47 QN48 QN49 QN50 QN51 QN52 QN53 QN54 QN55 QN56 QN57 QN58 QN59 QN60 QN61 QN62 QN63 QN64 QN65 QN68 QN69 QN70 QN71 QN72 QN73 QN74 QN75 QN76 QN77 QN78 QN79 QN80 QN81 QN82 QN83
# 1 . . . 1 2 1 2 . . . . . 2 . . 1 1 2 2 1 2 2 2 2 . . . . . . . . . . . . . 1 .
# 2 . . . 2 2 2 2 . . . . . 2 . . 1 1 1 2 2 . . . 2 . . . . . . . . . . . . . 1 .
# QN84 QN85 QN86 QN87 QN88 QN89 qnfrcig qndaycig qnfrevp qndayevp qnfrskl qndayskl qnfrcgr qndaycgr qntb2 qntb3 qntb4 qniudimp qnshparg qnothhpl qndualbc qnbcnone qnfr0 qnfr1 qnfr2 qnfr3 qnveg0
# 1 . . . . . . 2 2 . . . . . . . . . . . . . 2 . . . . .
# 2 . . . . . . 2 2 . . . . . . . . . . . . . . . . . . .
# qnveg1 qnveg2 qnveg3 qnsoda1 qnsoda2 qnsoda3 qnmilk1 qnmilk2 qnmilk3 qnbk7day qnpa0day qnpa7day qndlype qnnodnt qbikehelmet qdrivemarijuana qcelldriving qpropertydamage qbullyweight qbullygender
# 1 . . . . . . . . . . . . 1 . 2 NA NA NA NA NA
# 2 . . . . . . . . . . . . 1 . NA NA NA NA NA NA
# qbullygay qchokeself qcigschool qchewtobschool qalcoholschool qtypealcohol qhowmarijuana qmarijuanaschool qcurrentcocaine qcurrentheroin qcurrentmeth qhallucdrug qprescription30d qgenderexp
# 1 NA NA NA NA NA NA NA NA 1 NA NA NA NA NA
# 2 NA NA NA NA NA NA NA NA 1 NA NA NA NA NA
# qtaughtHIV qtaughtsexed qtaughtstd qtaughtcondom qtaughtbc qdietpop qcoffeetea qsportsdrink qenergydrink qsugardrink qwater qfastfood qfoodallergy qwenthungry qmusclestrength qsunscreenuse
# 1 2 NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NA
# 2 1 NA NA NA NA NA NA NA NA NA NA NA NA NA 2 NA
# qindoortanning qsunburn qconcentrating qcurrentasthma qwheresleep qspeakenglish qtransgender qnbikehelmet qndrivemarijuana qncelldriving qnpropertydamage qnbullyweight qnbullygender qnbullygay
# 1 NA NA NA NA NA NA NA 1 . . . . . .
# 2 NA NA NA NA NA NA NA . . . . . . .
# qnchokeself qncigschool qnchewtobschool qnalcoholschool qntypealcohol qnhowmarijuana qnmarijuanaschool qncurrentcocaine qncurrentheroin qncurrentmeth qnhallucdrug qnprescription30d qngenderexp
# 1 . . . . . . . 2 . . . . .
# 2 . . . . . . . 2 . . . . .
# qntaughtHIV qntaughtsexed qntaughtstd qntaughtcondom qntaughtbc qndietpop qncoffeetea qnsportsdrink qnspdrk1 qnspdrk2 qnspdrk3 qnenergydrink qnsugardrink qnwater qnwater1 qnwater2 qnwater3
# 1 2 . . . . . . . . . . . . . . . .
# 2 1 . . . . . . . . . . . . . . . .
# qnfastfood qnfoodallergy qnwenthungry qnmusclestrength qnsunscreenuse qnindoortanning qnsunburn qnconcentrating qncurrentasthma qnwheresleep qnspeakenglish qntransgender
# 1 . . . 2 . . . . . . . .
# 2 . . . 2 . . . . . . . .
Note that any character columns may need to be trimmed. Missing's are also "." So you would likely want to remove these as well.
edited Jan 19 at 5:39
answered Jan 19 at 5:34
KhaynesKhaynes
594520
594520
Please tell me there’s an automated way to do this, and you have to copy paste. I found that Hadley Wickham posted this dataset on GitHub, but it’s only up to 2013. github.com/hadley/yrbss I may try to repurpose his import script for this data and see if I can use it to import the 2017 data, just out of curiosity. Thanks so much for doing this.. I hope it wasn’t a total pain!
– Chris
Jan 19 at 13:55
Also, there are both NA’s and “."’s in this dataset. Any guesses to what the “." represent as opposed to the genuine NAs?
– Chris
Jan 19 at 20:16
Not 100%. It could be that "." ='s seen the question but not answered, whereas "" (or NA) could have skipped the question. Otherwise it could be the analysis software used and the variable classes or something similar. I think something like the latter as no columns have "." and NA's, right?
– Khaynes
Jan 19 at 20:57
add a comment |
Please tell me there’s an automated way to do this, and you have to copy paste. I found that Hadley Wickham posted this dataset on GitHub, but it’s only up to 2013. github.com/hadley/yrbss I may try to repurpose his import script for this data and see if I can use it to import the 2017 data, just out of curiosity. Thanks so much for doing this.. I hope it wasn’t a total pain!
– Chris
Jan 19 at 13:55
Also, there are both NA’s and “."’s in this dataset. Any guesses to what the “." represent as opposed to the genuine NAs?
– Chris
Jan 19 at 20:16
Not 100%. It could be that "." ='s seen the question but not answered, whereas "" (or NA) could have skipped the question. Otherwise it could be the analysis software used and the variable classes or something similar. I think something like the latter as no columns have "." and NA's, right?
– Khaynes
Jan 19 at 20:57
Please tell me there’s an automated way to do this, and you have to copy paste. I found that Hadley Wickham posted this dataset on GitHub, but it’s only up to 2013. github.com/hadley/yrbss I may try to repurpose his import script for this data and see if I can use it to import the 2017 data, just out of curiosity. Thanks so much for doing this.. I hope it wasn’t a total pain!
– Chris
Jan 19 at 13:55
Please tell me there’s an automated way to do this, and you have to copy paste. I found that Hadley Wickham posted this dataset on GitHub, but it’s only up to 2013. github.com/hadley/yrbss I may try to repurpose his import script for this data and see if I can use it to import the 2017 data, just out of curiosity. Thanks so much for doing this.. I hope it wasn’t a total pain!
– Chris
Jan 19 at 13:55
Also, there are both NA’s and “."’s in this dataset. Any guesses to what the “." represent as opposed to the genuine NAs?
– Chris
Jan 19 at 20:16
Also, there are both NA’s and “."’s in this dataset. Any guesses to what the “." represent as opposed to the genuine NAs?
– Chris
Jan 19 at 20:16
Not 100%. It could be that "." ='s seen the question but not answered, whereas "" (or NA) could have skipped the question. Otherwise it could be the analysis software used and the variable classes or something similar. I think something like the latter as no columns have "." and NA's, right?
– Khaynes
Jan 19 at 20:57
Not 100%. It could be that "." ='s seen the question but not answered, whereas "" (or NA) could have skipped the question. Otherwise it could be the analysis software used and the variable classes or something similar. I think something like the latter as no columns have "." and NA's, right?
– Khaynes
Jan 19 at 20:57
add a comment |
Although I can't answer your question fully, I can get you started. The reason you are unsure what to do is because the data are not formtted in a way you are used to. The data are in an ASCII format. Here's what the website says:
"Note: SAS and SPSS programs need to be used to convert ASCII into SAS and SPSS datasets. How to use the ASCII data varies from one software package to another. Column positions for each variable usually have to be specified. Column positions for each variable can be found in the documentation for each year’s data. Consult your software documentation for more information."
ASCII is just a different way of storing data, like a .csv, or other format, but it's just not as readable as having it all in columns. You can start but searching how to import ASCII data into R and go from there. Sorry I can be of more help.
Yea, it looks like they use a Lookup Table design to save space since it’s a massive dataset. It looks like one file is the data in the form of codes, then uses another file to look up what those codes mean.
– Chris
Jan 19 at 3:52
@Chris, you can get the concordance from the spss file and use that, it will take a while as there are so many columns: cdc.gov/healthyyouth/data/yrbs/sadc_2017/…
– Khaynes
Jan 19 at 3:56
but do you know the code to apply it? or are you saying i have to manually apply each concord :OOOOOO
– Chris
Jan 19 at 4:44
Hadley Wickham wrote some code to load this dataset it in, but it’s from the 2013 dataset: github.com/hadley/yrbss/blob/master/data-raw/survey.R I’m trying to figure out how to repurpose it for the new data
– Chris
Jan 19 at 4:51
add a comment |
Although I can't answer your question fully, I can get you started. The reason you are unsure what to do is because the data are not formtted in a way you are used to. The data are in an ASCII format. Here's what the website says:
"Note: SAS and SPSS programs need to be used to convert ASCII into SAS and SPSS datasets. How to use the ASCII data varies from one software package to another. Column positions for each variable usually have to be specified. Column positions for each variable can be found in the documentation for each year’s data. Consult your software documentation for more information."
ASCII is just a different way of storing data, like a .csv, or other format, but it's just not as readable as having it all in columns. You can start but searching how to import ASCII data into R and go from there. Sorry I can be of more help.
Yea, it looks like they use a Lookup Table design to save space since it’s a massive dataset. It looks like one file is the data in the form of codes, then uses another file to look up what those codes mean.
– Chris
Jan 19 at 3:52
@Chris, you can get the concordance from the spss file and use that, it will take a while as there are so many columns: cdc.gov/healthyyouth/data/yrbs/sadc_2017/…
– Khaynes
Jan 19 at 3:56
but do you know the code to apply it? or are you saying i have to manually apply each concord :OOOOOO
– Chris
Jan 19 at 4:44
Hadley Wickham wrote some code to load this dataset it in, but it’s from the 2013 dataset: github.com/hadley/yrbss/blob/master/data-raw/survey.R I’m trying to figure out how to repurpose it for the new data
– Chris
Jan 19 at 4:51
add a comment |
Although I can't answer your question fully, I can get you started. The reason you are unsure what to do is because the data are not formtted in a way you are used to. The data are in an ASCII format. Here's what the website says:
"Note: SAS and SPSS programs need to be used to convert ASCII into SAS and SPSS datasets. How to use the ASCII data varies from one software package to another. Column positions for each variable usually have to be specified. Column positions for each variable can be found in the documentation for each year’s data. Consult your software documentation for more information."
ASCII is just a different way of storing data, like a .csv, or other format, but it's just not as readable as having it all in columns. You can start but searching how to import ASCII data into R and go from there. Sorry I can be of more help.
Although I can't answer your question fully, I can get you started. The reason you are unsure what to do is because the data are not formtted in a way you are used to. The data are in an ASCII format. Here's what the website says:
"Note: SAS and SPSS programs need to be used to convert ASCII into SAS and SPSS datasets. How to use the ASCII data varies from one software package to another. Column positions for each variable usually have to be specified. Column positions for each variable can be found in the documentation for each year’s data. Consult your software documentation for more information."
ASCII is just a different way of storing data, like a .csv, or other format, but it's just not as readable as having it all in columns. You can start but searching how to import ASCII data into R and go from there. Sorry I can be of more help.
answered Jan 19 at 3:40
benso8benso8
133
133
Yea, it looks like they use a Lookup Table design to save space since it’s a massive dataset. It looks like one file is the data in the form of codes, then uses another file to look up what those codes mean.
– Chris
Jan 19 at 3:52
@Chris, you can get the concordance from the spss file and use that, it will take a while as there are so many columns: cdc.gov/healthyyouth/data/yrbs/sadc_2017/…
– Khaynes
Jan 19 at 3:56
but do you know the code to apply it? or are you saying i have to manually apply each concord :OOOOOO
– Chris
Jan 19 at 4:44
Hadley Wickham wrote some code to load this dataset it in, but it’s from the 2013 dataset: github.com/hadley/yrbss/blob/master/data-raw/survey.R I’m trying to figure out how to repurpose it for the new data
– Chris
Jan 19 at 4:51
add a comment |
Yea, it looks like they use a Lookup Table design to save space since it’s a massive dataset. It looks like one file is the data in the form of codes, then uses another file to look up what those codes mean.
– Chris
Jan 19 at 3:52
@Chris, you can get the concordance from the spss file and use that, it will take a while as there are so many columns: cdc.gov/healthyyouth/data/yrbs/sadc_2017/…
– Khaynes
Jan 19 at 3:56
but do you know the code to apply it? or are you saying i have to manually apply each concord :OOOOOO
– Chris
Jan 19 at 4:44
Hadley Wickham wrote some code to load this dataset it in, but it’s from the 2013 dataset: github.com/hadley/yrbss/blob/master/data-raw/survey.R I’m trying to figure out how to repurpose it for the new data
– Chris
Jan 19 at 4:51
Yea, it looks like they use a Lookup Table design to save space since it’s a massive dataset. It looks like one file is the data in the form of codes, then uses another file to look up what those codes mean.
– Chris
Jan 19 at 3:52
Yea, it looks like they use a Lookup Table design to save space since it’s a massive dataset. It looks like one file is the data in the form of codes, then uses another file to look up what those codes mean.
– Chris
Jan 19 at 3:52
@Chris, you can get the concordance from the spss file and use that, it will take a while as there are so many columns: cdc.gov/healthyyouth/data/yrbs/sadc_2017/…
– Khaynes
Jan 19 at 3:56
@Chris, you can get the concordance from the spss file and use that, it will take a while as there are so many columns: cdc.gov/healthyyouth/data/yrbs/sadc_2017/…
– Khaynes
Jan 19 at 3:56
but do you know the code to apply it? or are you saying i have to manually apply each concord :OOOOOO
– Chris
Jan 19 at 4:44
but do you know the code to apply it? or are you saying i have to manually apply each concord :OOOOOO
– Chris
Jan 19 at 4:44
Hadley Wickham wrote some code to load this dataset it in, but it’s from the 2013 dataset: github.com/hadley/yrbss/blob/master/data-raw/survey.R I’m trying to figure out how to repurpose it for the new data
– Chris
Jan 19 at 4:51
Hadley Wickham wrote some code to load this dataset it in, but it’s from the 2013 dataset: github.com/hadley/yrbss/blob/master/data-raw/survey.R I’m trying to figure out how to repurpose it for the new data
– Chris
Jan 19 at 4:51
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54263846%2fimporting-a-fixed-width-file-into-r-when-the-variables-are-defined-in-another-fi%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Hi Chris, the files (*.dat) are fixed width text files, you have to identify the position of the columns and specify them. See stackoverflow.com/questions/14383710/read-fixed-width-text-file
– Khaynes
Jan 19 at 3:35
:O there are 216 letter columns in this dataset. Edit: I’m sorry, 427.
– Chris
Jan 19 at 4:43
I get 314 in the
sadc_2017_national.dat
dataset? See answer that I posted.– Khaynes
Jan 19 at 5:35
Ahhh i see what it is..I was using the National Dataset from 2017. But actually I’ll use the complete dataset.
– Chris
Jan 19 at 13:49