Replace NAs with their respective column means from very large text file
I have a large text file: 400k rows and 10k columns, all numeric data values as 0,1,2. File size ranging 5-10GBs. I have a few missing values: NAs in the file. I want to replace the NA values with the column means, i.e. NA value in column 'x' must be replaced by the mean value of column 'x'. These are the steps that I want to do :
- Compute means of each column of my text file (excluding the header and starting from column7th)
- Replace NA in each column with their respective column means
- Write the modified file back as a txt file
Data subset:
IID FID PAT MAT SEX PHENOTYPE X1 X2 X3 X4......
1234 1234 0 0 1 -9 0 NA 0 1
2346 2346 0 0 2 -9 1 2 NA 1
1334 1334 0 0 2 -9 2 NA 0 2
4566 4566 0 0 2 -9 2 2 NA 0
4567 4567 0 0 1 -9 NA NA 1 1
# total 400k rows and 10k columns
Desired Output:
# Assuming only 5 rows as given in the above example.
# Mean of column X1 = (0 + 1+ 2+ 2)/4 = 1.25
# Mean of column X2 = (2 + 2)/2 = 2
# Mean of column X3 = (0 + 0 + 1)/3 = 0.33
# Mean of column X4 = No NAs, so no replacements
# Replacing NAs with respective means:
IID FID PAT MAT SEX PHENOTYPE X1 X2 X3 X4......
1234 1234 0 0 1 -9 0 2 0 1
2346 2346 0 0 2 -9 1 2 0.33 1
1334 1334 0 0 2 -9 2 2 0 2
4566 4566 0 0 2 -9 2 2 0.33 0
4567 4567 0 0 1 -9 1.25 2 1 1
I tried this:
file="path/to/data.txt"
#get total number of columns
number_cols=$(awk -F' ' '{print NF; exit}' $file)
for ((i=7; i<=$number_cols; i=i+1))
do
echo $i
# getting the mean of each column
mean+=$(awk '{ total += $i } END { print total/NR }' $file)
done
# array of column means
echo ${mean[@]}
# find and replace (newstr must be replaced by respective column means)
find $file -type f -exec sed -i 's/NA/newstr/g' {} ;
However, this code is incomplete. The for loop is very slow since my data is huge. Is there another way to do this faster? I did this in Python and R, but it is too slow. I am open to get this done in any programming language as long as it is fast. Can someone please help me write the script?
Thanks
linux bash shell
|
show 2 more comments
I have a large text file: 400k rows and 10k columns, all numeric data values as 0,1,2. File size ranging 5-10GBs. I have a few missing values: NAs in the file. I want to replace the NA values with the column means, i.e. NA value in column 'x' must be replaced by the mean value of column 'x'. These are the steps that I want to do :
- Compute means of each column of my text file (excluding the header and starting from column7th)
- Replace NA in each column with their respective column means
- Write the modified file back as a txt file
Data subset:
IID FID PAT MAT SEX PHENOTYPE X1 X2 X3 X4......
1234 1234 0 0 1 -9 0 NA 0 1
2346 2346 0 0 2 -9 1 2 NA 1
1334 1334 0 0 2 -9 2 NA 0 2
4566 4566 0 0 2 -9 2 2 NA 0
4567 4567 0 0 1 -9 NA NA 1 1
# total 400k rows and 10k columns
Desired Output:
# Assuming only 5 rows as given in the above example.
# Mean of column X1 = (0 + 1+ 2+ 2)/4 = 1.25
# Mean of column X2 = (2 + 2)/2 = 2
# Mean of column X3 = (0 + 0 + 1)/3 = 0.33
# Mean of column X4 = No NAs, so no replacements
# Replacing NAs with respective means:
IID FID PAT MAT SEX PHENOTYPE X1 X2 X3 X4......
1234 1234 0 0 1 -9 0 2 0 1
2346 2346 0 0 2 -9 1 2 0.33 1
1334 1334 0 0 2 -9 2 2 0 2
4566 4566 0 0 2 -9 2 2 0.33 0
4567 4567 0 0 1 -9 1.25 2 1 1
I tried this:
file="path/to/data.txt"
#get total number of columns
number_cols=$(awk -F' ' '{print NF; exit}' $file)
for ((i=7; i<=$number_cols; i=i+1))
do
echo $i
# getting the mean of each column
mean+=$(awk '{ total += $i } END { print total/NR }' $file)
done
# array of column means
echo ${mean[@]}
# find and replace (newstr must be replaced by respective column means)
find $file -type f -exec sed -i 's/NA/newstr/g' {} ;
However, this code is incomplete. The for loop is very slow since my data is huge. Is there another way to do this faster? I did this in Python and R, but it is too slow. I am open to get this done in any programming language as long as it is fast. Can someone please help me write the script?
Thanks
linux bash shell
Please add your desired output for that sample input to your question.
– Cyrus
Jan 18 at 19:14
Added the desired output, thanks
– user3720537
Jan 18 at 20:02
I would recommend to use a file format that is designed for huge files, like HDF5. This would definitely help you to get and set the right values to the right places.
– MaxPowers
Jan 18 at 21:10
1
(1) First, your approach reads the entire data #column times, which is very inefficient. Columnwise operation requires frequent seek across rows anyway. Try to transpose the data in advance so you can access all entries of the specified column as a line. Then you can store the column data in a memory all at once to access them swiftly. (2) Do not use bash, awk or other scripting language. Their indexing mechanism of array is not efficient in terms of access time. I'd adopt C or C++ which stores each element of array in a linear address, enabling the addressing of the element very fast.
– tshiono
Jan 19 at 2:50
(3) Please reconsider the purpose and the requirement. Interpolating the NA value with other values means you are creating untrue data anyway. I don't think you need to use whole column data for the purpose. I'd just use the value in the previos line or a few lines around. It might be better to discard the row which contains NA value as a data cleansing depending on the purpose. The decision may require statistics background and engineering sense of compromisation.
– tshiono
Jan 19 at 2:50
|
show 2 more comments
I have a large text file: 400k rows and 10k columns, all numeric data values as 0,1,2. File size ranging 5-10GBs. I have a few missing values: NAs in the file. I want to replace the NA values with the column means, i.e. NA value in column 'x' must be replaced by the mean value of column 'x'. These are the steps that I want to do :
- Compute means of each column of my text file (excluding the header and starting from column7th)
- Replace NA in each column with their respective column means
- Write the modified file back as a txt file
Data subset:
IID FID PAT MAT SEX PHENOTYPE X1 X2 X3 X4......
1234 1234 0 0 1 -9 0 NA 0 1
2346 2346 0 0 2 -9 1 2 NA 1
1334 1334 0 0 2 -9 2 NA 0 2
4566 4566 0 0 2 -9 2 2 NA 0
4567 4567 0 0 1 -9 NA NA 1 1
# total 400k rows and 10k columns
Desired Output:
# Assuming only 5 rows as given in the above example.
# Mean of column X1 = (0 + 1+ 2+ 2)/4 = 1.25
# Mean of column X2 = (2 + 2)/2 = 2
# Mean of column X3 = (0 + 0 + 1)/3 = 0.33
# Mean of column X4 = No NAs, so no replacements
# Replacing NAs with respective means:
IID FID PAT MAT SEX PHENOTYPE X1 X2 X3 X4......
1234 1234 0 0 1 -9 0 2 0 1
2346 2346 0 0 2 -9 1 2 0.33 1
1334 1334 0 0 2 -9 2 2 0 2
4566 4566 0 0 2 -9 2 2 0.33 0
4567 4567 0 0 1 -9 1.25 2 1 1
I tried this:
file="path/to/data.txt"
#get total number of columns
number_cols=$(awk -F' ' '{print NF; exit}' $file)
for ((i=7; i<=$number_cols; i=i+1))
do
echo $i
# getting the mean of each column
mean+=$(awk '{ total += $i } END { print total/NR }' $file)
done
# array of column means
echo ${mean[@]}
# find and replace (newstr must be replaced by respective column means)
find $file -type f -exec sed -i 's/NA/newstr/g' {} ;
However, this code is incomplete. The for loop is very slow since my data is huge. Is there another way to do this faster? I did this in Python and R, but it is too slow. I am open to get this done in any programming language as long as it is fast. Can someone please help me write the script?
Thanks
linux bash shell
I have a large text file: 400k rows and 10k columns, all numeric data values as 0,1,2. File size ranging 5-10GBs. I have a few missing values: NAs in the file. I want to replace the NA values with the column means, i.e. NA value in column 'x' must be replaced by the mean value of column 'x'. These are the steps that I want to do :
- Compute means of each column of my text file (excluding the header and starting from column7th)
- Replace NA in each column with their respective column means
- Write the modified file back as a txt file
Data subset:
IID FID PAT MAT SEX PHENOTYPE X1 X2 X3 X4......
1234 1234 0 0 1 -9 0 NA 0 1
2346 2346 0 0 2 -9 1 2 NA 1
1334 1334 0 0 2 -9 2 NA 0 2
4566 4566 0 0 2 -9 2 2 NA 0
4567 4567 0 0 1 -9 NA NA 1 1
# total 400k rows and 10k columns
Desired Output:
# Assuming only 5 rows as given in the above example.
# Mean of column X1 = (0 + 1+ 2+ 2)/4 = 1.25
# Mean of column X2 = (2 + 2)/2 = 2
# Mean of column X3 = (0 + 0 + 1)/3 = 0.33
# Mean of column X4 = No NAs, so no replacements
# Replacing NAs with respective means:
IID FID PAT MAT SEX PHENOTYPE X1 X2 X3 X4......
1234 1234 0 0 1 -9 0 2 0 1
2346 2346 0 0 2 -9 1 2 0.33 1
1334 1334 0 0 2 -9 2 2 0 2
4566 4566 0 0 2 -9 2 2 0.33 0
4567 4567 0 0 1 -9 1.25 2 1 1
I tried this:
file="path/to/data.txt"
#get total number of columns
number_cols=$(awk -F' ' '{print NF; exit}' $file)
for ((i=7; i<=$number_cols; i=i+1))
do
echo $i
# getting the mean of each column
mean+=$(awk '{ total += $i } END { print total/NR }' $file)
done
# array of column means
echo ${mean[@]}
# find and replace (newstr must be replaced by respective column means)
find $file -type f -exec sed -i 's/NA/newstr/g' {} ;
However, this code is incomplete. The for loop is very slow since my data is huge. Is there another way to do this faster? I did this in Python and R, but it is too slow. I am open to get this done in any programming language as long as it is fast. Can someone please help me write the script?
Thanks
linux bash shell
linux bash shell
edited Jan 18 at 21:54
Poshi
2,3161821
2,3161821
asked Jan 18 at 18:57
user3720537user3720537
142
142
Please add your desired output for that sample input to your question.
– Cyrus
Jan 18 at 19:14
Added the desired output, thanks
– user3720537
Jan 18 at 20:02
I would recommend to use a file format that is designed for huge files, like HDF5. This would definitely help you to get and set the right values to the right places.
– MaxPowers
Jan 18 at 21:10
1
(1) First, your approach reads the entire data #column times, which is very inefficient. Columnwise operation requires frequent seek across rows anyway. Try to transpose the data in advance so you can access all entries of the specified column as a line. Then you can store the column data in a memory all at once to access them swiftly. (2) Do not use bash, awk or other scripting language. Their indexing mechanism of array is not efficient in terms of access time. I'd adopt C or C++ which stores each element of array in a linear address, enabling the addressing of the element very fast.
– tshiono
Jan 19 at 2:50
(3) Please reconsider the purpose and the requirement. Interpolating the NA value with other values means you are creating untrue data anyway. I don't think you need to use whole column data for the purpose. I'd just use the value in the previos line or a few lines around. It might be better to discard the row which contains NA value as a data cleansing depending on the purpose. The decision may require statistics background and engineering sense of compromisation.
– tshiono
Jan 19 at 2:50
|
show 2 more comments
Please add your desired output for that sample input to your question.
– Cyrus
Jan 18 at 19:14
Added the desired output, thanks
– user3720537
Jan 18 at 20:02
I would recommend to use a file format that is designed for huge files, like HDF5. This would definitely help you to get and set the right values to the right places.
– MaxPowers
Jan 18 at 21:10
1
(1) First, your approach reads the entire data #column times, which is very inefficient. Columnwise operation requires frequent seek across rows anyway. Try to transpose the data in advance so you can access all entries of the specified column as a line. Then you can store the column data in a memory all at once to access them swiftly. (2) Do not use bash, awk or other scripting language. Their indexing mechanism of array is not efficient in terms of access time. I'd adopt C or C++ which stores each element of array in a linear address, enabling the addressing of the element very fast.
– tshiono
Jan 19 at 2:50
(3) Please reconsider the purpose and the requirement. Interpolating the NA value with other values means you are creating untrue data anyway. I don't think you need to use whole column data for the purpose. I'd just use the value in the previos line or a few lines around. It might be better to discard the row which contains NA value as a data cleansing depending on the purpose. The decision may require statistics background and engineering sense of compromisation.
– tshiono
Jan 19 at 2:50
Please add your desired output for that sample input to your question.
– Cyrus
Jan 18 at 19:14
Please add your desired output for that sample input to your question.
– Cyrus
Jan 18 at 19:14
Added the desired output, thanks
– user3720537
Jan 18 at 20:02
Added the desired output, thanks
– user3720537
Jan 18 at 20:02
I would recommend to use a file format that is designed for huge files, like HDF5. This would definitely help you to get and set the right values to the right places.
– MaxPowers
Jan 18 at 21:10
I would recommend to use a file format that is designed for huge files, like HDF5. This would definitely help you to get and set the right values to the right places.
– MaxPowers
Jan 18 at 21:10
1
1
(1) First, your approach reads the entire data #column times, which is very inefficient. Columnwise operation requires frequent seek across rows anyway. Try to transpose the data in advance so you can access all entries of the specified column as a line. Then you can store the column data in a memory all at once to access them swiftly. (2) Do not use bash, awk or other scripting language. Their indexing mechanism of array is not efficient in terms of access time. I'd adopt C or C++ which stores each element of array in a linear address, enabling the addressing of the element very fast.
– tshiono
Jan 19 at 2:50
(1) First, your approach reads the entire data #column times, which is very inefficient. Columnwise operation requires frequent seek across rows anyway. Try to transpose the data in advance so you can access all entries of the specified column as a line. Then you can store the column data in a memory all at once to access them swiftly. (2) Do not use bash, awk or other scripting language. Their indexing mechanism of array is not efficient in terms of access time. I'd adopt C or C++ which stores each element of array in a linear address, enabling the addressing of the element very fast.
– tshiono
Jan 19 at 2:50
(3) Please reconsider the purpose and the requirement. Interpolating the NA value with other values means you are creating untrue data anyway. I don't think you need to use whole column data for the purpose. I'd just use the value in the previos line or a few lines around. It might be better to discard the row which contains NA value as a data cleansing depending on the purpose. The decision may require statistics background and engineering sense of compromisation.
– tshiono
Jan 19 at 2:50
(3) Please reconsider the purpose and the requirement. Interpolating the NA value with other values means you are creating untrue data anyway. I don't think you need to use whole column data for the purpose. I'd just use the value in the previos line or a few lines around. It might be better to discard the row which contains NA value as a data cleansing depending on the purpose. The decision may require statistics background and engineering sense of compromisation.
– tshiono
Jan 19 at 2:50
|
show 2 more comments
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54259950%2freplace-nas-with-their-respective-column-means-from-very-large-text-file%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54259950%2freplace-nas-with-their-respective-column-means-from-very-large-text-file%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Please add your desired output for that sample input to your question.
– Cyrus
Jan 18 at 19:14
Added the desired output, thanks
– user3720537
Jan 18 at 20:02
I would recommend to use a file format that is designed for huge files, like HDF5. This would definitely help you to get and set the right values to the right places.
– MaxPowers
Jan 18 at 21:10
1
(1) First, your approach reads the entire data #column times, which is very inefficient. Columnwise operation requires frequent seek across rows anyway. Try to transpose the data in advance so you can access all entries of the specified column as a line. Then you can store the column data in a memory all at once to access them swiftly. (2) Do not use bash, awk or other scripting language. Their indexing mechanism of array is not efficient in terms of access time. I'd adopt C or C++ which stores each element of array in a linear address, enabling the addressing of the element very fast.
– tshiono
Jan 19 at 2:50
(3) Please reconsider the purpose and the requirement. Interpolating the NA value with other values means you are creating untrue data anyway. I don't think you need to use whole column data for the purpose. I'd just use the value in the previos line or a few lines around. It might be better to discard the row which contains NA value as a data cleansing depending on the purpose. The decision may require statistics background and engineering sense of compromisation.
– tshiono
Jan 19 at 2:50